R Data Mining

Why to Choose R for Your Data Mining and Where to Start

Since this is our first step on the journey to R knowledge, we have to be sure to acquire all the tools and notions we will use on our trip. You are probably already an R enthusiast and would like to discover more about it, but maybe you are not so sure why you should invest time in learning it. Perhaps you lack confidence in defining its points of strength and weakness, and therefore you are not sure it is the right language to bet on. Crucially, you do not actually know where and how to practically begin your journey to R mastery. The good news, is you will not have to wait long to solve all of these issues, since this first chapter is all about them.

In particular, within this chapter we will:

Look at the history of R to understand where everything came from
Analyze R's points of strength, understanding why it is a savvy idea to learn this programming language
Learn how to install the R language on your computer and how to write and run R code
Gain an understanding of the R language and the foundation notions needed to start writing R scripts
Understand R's points of weakness and how to work around them

By the end of the chapter, we will have all the weapons needed to face our first real data mining problem.

What is R?

Let's start from the very beginning, What exactly is R? You will have read a lot about it on data analysis and data science blogs and websites, but perhaps you are still not able to fix the concept in your mind. R is a high-level programming language. This means that by passing the kind of R scripts you are going to learn in this book, you will be able to order your PC to execute some desired computations and operations, resulting in some predefined output.

Programming languages are a set of predefined instructions that the computer is able to understand and react to, and R is one of them. You may have noticed that I referred to R as a high-level programming language. What does high-level mean? One way to understand it is by comparing it to typical industrial company structures. Within such companies, there is usually a CEO, senior managers, heads of departments, and so on, level by level until we reach the final group of workers.

What is the difference between those levels of a company hierarchy? The CEO makes the main strategical decisions, developing a strategical plan without taking care of tactical and operational details. From there, the lower you go in the hierarchy described, the more tactical and operational decisions become, until you reach the base worker, whose main duty is to execute basic operations, such as screwing and hammering.

It is the same for programming languages:

High-level programming languages are like the CEO; they abstract from operational details, stating high-level sentences which will then be translated by lower-level languages the computer is able to understand
Low-level programming languages are like the heads of departments and workers; they take sentences from higher-level languages and translate them into chunks of instructions needed to make the computer actually produce the output the CEO is looking for

To be precise, we should specify that it is also possible to directly write code using low-level programming languages. Nevertheless, since they tend to be more complex and wordy, their popularity has declined over time.

Now that we have a clear idea of what R is, let's move on and acquire a bit of knowledge about where R came from and when.

R's points of strength

You know that R is really popular, but why? R is not the only data analysis language out there, and neither is it the oldest one; so why is it so popular?

If looking at the root causes of R's popularity, we definitely have to mention these three:

Open source inside
Plugin ready
Data visualization friendly

Open source inside

One of the main reasons the adoption of R is spreading is its open source nature. R binary code is available for everyone to download, modify, and share back again (only in an open source way). Technically, R is released with a GNU general public license, meaning that you can take it and use it for whatever purpose; but you have to share every derivative with a GNU general public license as well.

These attributes fit well for almost every target user of a statistical analysis language:

Academic user: Knowledge sharing is a must for an academic environment, and having the ability to share work without the worry of copyright and license questions makes R very practical for academic research purposes
Business user: Companies are always worried about budget constraints; having professional statistical analysis software at their disposal for free sounds like a dream come true
Private user: This user merges together both of the benefits already mentioned, because they will find it great to have a free instrument with which to learn and share their own statistical analyses

Plugin ready

You could imagine the R language as an expandable board game. You know, games like 7 Wonders or Carcassonne, with a base set of characters and places and further optional places and characters, increasing the choices at your disposal and maximizing the fun. The R language can be compared to this kind of game.

There is a base version of R, containing a group of default packages that are delivered along with the standard version of the software (you can skip to the Installing R and writing R code section for more on how to obtain and install it). The functionalities available through the base version are mainly related to filesystem manipulation, statistical analysis, and data visualization.

While this base version is regularly maintained and updated by the R core team, virtually every R user can add further new functionalities to those available within the package, developing and sharing custom packages.

This is basically how the package development and sharing flow works:

The R user develops a new package, for example a package introducing a new machine learning algorithm exposed within a freshly published academic paper.
The user submits the package to the CRAN repository or a similar repository. The Comprehensive R Archive Network (CRAN) is the official repository for R-related documents and packages.

Every R user can gain access to the additional features introduced with any given package, installing and loading them into their R environment. If the package has been submitted to CRAN, installing and loading the package will result in running just the two following lines of R code (similar commands are available for alternative repositories such as Bioconductor):

install.packages("ggplot2")
library(ggplot2)

As you can see, this is a really convenient and effective way to expand R functionalities, and you will soon see how wide the range of functionalities added through additional packages developed by R users is.

More than 9,000 packages are available on CRAN, and this number is sure to increase further, making more and more additional features available to the R community.

Data visualization friendly

as a discipline data visualization encompasses all of the principles and techniques employable to effectively display the information and messages contained within a set of data.

Since we are living in an information-heavy age, the ability to effectively and concisely communicate articulated and complex messages through data visualization is a core asset for any professional. This is exactly why R is experiencing a great response in academic and professional fields: the data visualization capabilities of R place it at the cutting edge of these fields.

R has been noticed for its amazing data visualization features right from its beginning; when some of its peers still showed x axes-built aggregating + signs, R was already able to produce astonishing 3D plots. Nevertheless, a major improvement of R as a data visualization technique came when Auckland's Hadley Wickham developed the highly famous ggplot2 package based on The Grammar of Graphics, introducing into the R world an organic framework for data visualization tasks:

This package alone introduced the R community to a highly flexible way of producing and visualizing almost every kind of data visualization, having also been designed as an expandable tool, in order to add the possibility of incorporating new data visualization techniques as soon as they emerge. Finally, ggplot2 gives you the ability to highly customize your plot, adding every kind of graphical or textual annotation to it.

Nowadays, R is being used by the biggest tech companies, such as Facebook and Google, and by widely circulated publications such as the Economist and the New York Times to visualize their data and convey their information to their stakeholders and readers.

To sum all this up—should you invest your precious time learning R? If you are a professional or a student who could gain advantages from knowing effective and cutting-edge techniques to manipulate, model, and present data, I can only give you a positive opinion: yes. You should definitely learn R, and consider it a long-term investment, since the points of strength we have seen place it in a great position to further expand its influence in the coming years in every industry and academic field.

Installing R and writing R code

Now that you know why it is worth learning R as a language for data analysis, let's have a look at how to get up and running with R coding. First of all, let's have a bit of clarity—installing R is different from installing an integrated platform on which to write and run R code. Here, you will learn both of these and the differences between them.

Downloading R

Installing R means installing the R language interpreter on your computer. This will teach your computer how to execute R commands and R scripts, marked with the .R file extension. The most up-to-date release of the R language is hosted on the official R project server, reachable at https://cran.r-project.org.

Once you have surfed the website, you will have to locate the proper download link, that is, the link to the R version appropriate for your platform. You will have these three choices:

Download R for Linux (https://cran.r-project.org/bin/linux/)
Download R for macOS (https://cran.r-project.org/bin/macosx/)
Download R for Windows (https://cran.r-project.org/bin/windows/)

R installation for Windows and macOS

For macOS and Windows, you will follow a similar workflow:

Download the files bundle you will be pointed to from the platform-related page.
Within the bundle, locate the appropriate installer:
- The one for Windows will be named something like R-3.3.2-win.exe
- The one for macOS will be similar to R-3.3.2.pkg
Execute that installer and wait for the installation process to complete:

Once you are done with this procedure, R will be installed on your platform and you will be ready to employ it. If you are a Linux user, things will look a little different.

R installation for Linux OS

The most convenient choice, if you are a Linux user, is to install the R base version directly from your command line. This is actually a straightforward procedure that only requires you to run the following commands on your Terminal:

sudo apt-get update
sudo apt-get install r-base

This will likely result in the Terminal asking you for your machine administrator password, which is strictly required to perform commands as a superuser (that is what sudo stands for).

Main components of a base R installation

You may be wondering what you get with the installation you just performed, and that is what we are going to look at here. First of all, the base R version comes with a proper interpreter of the most updated version of the R software. This means, if you recall what we learned in the What is R? section, that after performing your installation, the computer will be able to read R code, parse it, and execute instructions composed of parsed code. To get a feel for this, try the following code on your OS command line, choosing the appropriate one:

On Windows OS (on PowerShell):

echo "print('hello world')" >> new_script.R
Rscript.exe new_script.R

On macOS or Linux OS:

R
print('hello world')

Both of these should result in the evergreen 'hello world' output.

Apart from the interpreter, the R language base version also comes packed with a very basic platform for the development and execution of R code, which is mainly composed of:

An R console to execute R code and observe the results of the execution
An R script text editor to write down the R code and subsequently save it as standalone scripts (the ones with the .R file extension)
Additional utilities, such as functions to import data, install additional packages, and navigate your console history:

This was the way R code was produced and consumed by the vast majority of the R community for a long time. Nowadays, even though it runs perfectly and is regularly updated, this platform tends to appear one step behind the available alternatives we are going to explore in the next section.

Possible alternatives to write and run R code

We have already discussed two ways of executing R code:

Employing your OS terminal
Employing the development environment that comes with the R base installation

The first of the aforementioned ways can be quite a convenient way for experienced R users. It clearly shows its advantages when executing articulated analytical activities, such as ones requiring:

The sequential execution of scripts from different languages
The execution of filesystem manipulation

Regarding the second alternative, we have already talked about its shortfalls compared to its direct competitor. Therefore, now is the time to have a closer look at this competitor, and this is what we are going to do in the following paragraphs before actually starting to write some more R code.

Two disclaimers are needed:

We are not considering text editor applications here, that is, software without an R console included and additional code execution utilities included. Rather, we prefer an integrated development environment, since they are able to provide a more user-friendly and comprehensive experience for a new language adopter.
We are not looking for completeness here, just for the tools most often cited within R community discussions and events. Perhaps something better than these platforms is available, but it has not yet gained comparable momentum.

The alternative platforms we are going to introduce here are:

RStudio
Jupyter Notebook
Visual Studio

RStudio (all OSs)

RStudio is a really well-known IDE within the R community. It is freely available at https://www.rstudio.com. The main reason for its popularity is probably the R-dedicated nature of the platform, which differentiates it from the other two alternatives that we will discuss further, and its perfect integration with some of the most beloved packages of the R community.

RStudio comes packed with all the base features we talked about when discovering the R base installation development environment, enriched with a ton of useful additional components introduced to facilitate coding activity and maximize the effectiveness of the development process. Among those, we should point out:

A filesystem browser to explore and interact with the content of the directory you are working with
A file import wizard to facilitate the import of datasets
A plot pane to visualize and interact with the data visualization produced by code execution
An environment explorer to visualize and interact with values and the data produced by code execution
A spreadsheet-like data viewer to visualize the datasets produced by code execution

All of this is enhanced by features such as code autocompletion, inline help for functions, and splittable windows for multi-monitor users, as seen in the following screenshot:

A final word has to be said about integration with the most beloved R additional packages. RStudio comes with additional controls or predefined shortcuts to fully integrate, for instance:

markdown package for markdown integration with R code (more on this in Chapter 13, Sharing your stories with your stakeholders through R markdown)
dplyr for data manipulation (more on this in Chapter 2, A First Primer on Data Mining - Analysing Your Banking Account Data)
shiny package for web application development with R (more on this in Chapter 13, Sharing your stories with your stakeholders through R markdown)

The Jupyter Notebook (all OSs)

The Jupyter Notebook was primarily born as a Python extension to enable interactive data analysis and a fully reproducible workflow. The idea behind the Jupyter Notebook is to have both the code and the output of the code (plots and tables) within the same document. This allows both the developer and other subsequent readers, for instance a customer, to follow the logical flow of the analysis and gradually arrive at the results.

Compared to RStudio, Jupyter does not have a filesystem browser, nor an environment browser. Nevertheless, it is a very good alternative, especially when working on analyses which need to be shared.

Since it comes originally as a Python extension, it is actually developed with the Python language. This means that you will need to install Python as well as R to execute this application. Instructions on how to install Jupyter can be found in the Jupyter documentation at http://jupyter.readthedocs.io/en/latest/install.html.

After installing Jupyter, you will need to add a specific component, namely a kernel, to execute R code on the notebook. Instructions on how to install the kernel can be found on the component's home page at https://irkernel.github.io.

Visual Studio (Windows users only)

Visual Studio is a popular development tool, primarily for Visual Basic and C++ language development. Due to the recent interest showed by Microsoft in the R language, this IDE has been expanded through the introduction of the R Tools extension.

This extension adds all of the commonly expected features of an R IDE to the well-established platform such as Visual Studio. The main limitation at the moment is the availability of the product, as it is only available on a computer running on the Windows OS.

Also, Visual Studio is available for free, at least the Visual Studio Community Edition. Further details and installation guides are available at https://www.visualstudio.com/vs/rtvs.

R foundational notions

Now that you have installed R and your chosen R development environment, it is time to try them out, acquiring some foundations of the R language. Here, we are going to cover the main building blocks we will use along our journey to build and apply the data mining algorithms this book is all about. More specifically, after warming up a bit by performing basic operations on the interactive console and saving our first R script, we are going to learn how to create and handle:

Vectors, which are ordered sequences of values, or even just one value
Lists, which are defined as a collection of vectors and of every other type of object available in R
Dataframes, which can be seen as lists composed by vectors, all with the same number of values
Functions, which are a set of instructions performed by the language that can be applied to vectors, lists, and data frames to manipulate them and gain new information from them:

Finally, we will look at how to define custom functions and how to install additional packages to extend R language functionalities. If you feel overwhelmed by this list of unknown entities, I would like to assure you that we are going to get really familiar with all of them within a few pages.

A preliminary R session

Before getting to know the alphabet of our powerful language, we need to understand the basics of how to employ it. We are going to:

Perform some basic operations on the R console
Save our first R script
Execute our script from the console

Executing R interactively through the R console

Once you have opened your favourite IDE (we are going to use RStudio), you should find an interactive console, which you should be able to recognize by the intermittent cursor you should find on it. Once you have located it, just try to perform a basic operation by typing the following words and pressing Enter, submitting the command to the console:

2+2

A new line will automatically appear, showing you the following unsurprising result:

Yes, just to reassure you, we are going to discuss more sophisticated mathematical computations; this was just an introductory example.

What I would like to stress with this is that within the console, you can interactively test small chunks of code. What is the disadvantage here? When you terminate your R session (shutting down your IDE), everything that you performed within the console will be lost. There are actually IDEs, such as RStudio, that store your console history, but that is intended as an audit trail rather than as a proper way to store your code:

In the next paragraph, we are going to see the proper way to store your console history. In the meantime, for the sake of completeness, let me clarify for you that the R language can perform all the basic mathematical operations, employing the following operators: +, -, *, /, ^, the last of which is employed when raising to a power.

Creating an R script

An R script is a statistical document storing a large or small chunk of R code. The advantage of the script is that it can store and show a structured set of instructions to be executed every time or recalled from outside the script itself (see the next paragraph for more on this). Within your IDE, you will find a New script control that, if selected, will result in a new file with the .R extension coming up, ready to be filled with R language. If there is no similar control within the IDE you chose, first of all, you should seriously think about looking for another IDE, and then you can deal with the emergency by running the following command within the R console:

file.create("my_first_script.R")

Let's start writing some code within our script. Since there is a long tradition to be respected, we are going to test our script with the well-known, useless statement, "hello world". To obtain those two amazing words as an output, you just have to tell R to print them out. How is that done? Here we are:

print("hello world")

Once again, for the reader afraid of having wasted his money with this book, we are going to deal with more difficult topics; we are just warming up here.

Before moving on, let's add one more line, not in the form of a command, but as a comment:

# my dear interpreter, please do not execute this line, it is just a comment

Comments are actually a really relevant piece of software development. As you might guess, such lines are not executed by the interpreter, which is programmed to skip all lines starting with the # token. Nevertheless, comments are a precious friend of the programmer, and an even more precious friend of the same programmer one month after having written the script, and of any other reader of the given code. These pieces of text are employed to mark the rationales, assumptions, and objectives of the code, in order to make clear what the scope of the script is, why certain manipulations were performed, and what kind of assumptions are to be satisfied to ensure the script is working properly.

One final note on comments—you can put them inline with some other code, as in the following example:

print("hello world") # dear interpreter, please do not execute this comment

It is now time to save your file, which just requires you to find the Save control within your IDE. When a name is required, just name it my_first_script.R, since we are going to use it in a few moments.

Executing an R script

The further you get with your coding expertise, the more probable it is that you will find yourself storing different parts of your analyses in separate scripts, calling them in a sequence from the terminal or directly from a main script. It is therefore crucial to learn how to correctly perform this kind of operation from the very beginning of our learning path. Moreover, executing a script from the beginning to the end is a really good method for detecting errors, that is, bugs, within your code. Finally, storing your analyses within scripts will help make them reproducible for other interested peoples, which is a really desirable property able to strengthen the validity of your results.

Let's try to execute the script we previously created. To execute a script from within R, we use the source() function. As we will see in more depth later, a function is a set of instructions which usually takes one or more inputs and produces an output. The input is called an argument, while the output is called a value. In this case, we are going to specify one unique argument, the file argument. As you may be wondering, the argument will have the name of the R script we saved before. With all that mentioned, here is the command to submit:

source("my_first_script.R")

What happens when this command is run? You can imagine the interpreter reading the line of code and thinking the following: OK, let's have a look at what is inside this my_first_script file. Nice, here's another R command: print("hello world"). Let's run it and see what happens! Apart from the fictional tone, this is exactly what happens. The interpreter looks for the file you pointed to, reads the contents of the file, and executes the R commands stored in it. Our example will result in the console producing the following output:

hello world

It is now time to actually learn the R alphabet, starting with vectors.

Vectors

What are vectors and where do we use them? The term vector is directly derived from the algebra field, but we shouldn't take the analogy too much further than that since within the R world, we can simply consider a vector to be an ordered sequence of values of the same data type. A sequence is ordered such that the two sequences represented below are treated as two different entities by R:

How do you create a vector in R? A vector is created through the c() function, as in the following statement:

c(100,20,40,15,90)

Even if this is a regular vector, it will disappear as long as it is printed out by the console. If you want to store it in your R environment, you should assign it a name, that is, you should create a variable. This is easily done by the assignment operator:

vector <- c(100,20,40,15,90)

As soon as you run this command, your environment will be enriched by a new object of type vector. This is fine, but what is the practical usage of vectors? Almost every input and output produced by R can be reduced to a vector, meaning it represents the foundation for every development of this language. Within this book, for instance, we are going to store the results of statistical tests performed on our data in vectors, and create a vector representing a probability distribution we want our model to respect.

A final relevant note on vectors—so far, we have seen only a numerical vector, but you should be aware that it is possible to define all of the following types of vectors:

Type	Example
numeric	1
logical / Boolean	TRUE
character	"text here"

Moreover, it is possible to define mixed content vectors:

mixed_vector <- c( 1, TRUE, "text here")

To be exact, by the end these kinds of vectors will be forced to a vector of the type that can contain all the others, like character in our example, but I do not want to confuse you with too many details.

So, now we know how to create a vector and what to store within it, but how do we recall it and show its content? As a general rule, recalling an object will simply require you to write down its name. So, to show the mixed_vector we just created, it will be sufficient to write down its name within the R console and submit this minimal command. The result will be the following:

[1] "1"         "TRUE"      "text here"

Lists

Now that you know what vectors are, you can easily understand what lists are: containers of objects. This is actually an oversimplification of lists, since they can also contain other lists, or even data frames inside them. Nevertheless, the relevant concept here is that lists are a convenient way to store objects within the R environment. For instance, they are used by a lot of statistical functions to store the results of their applications.

Let's show this to you practically:

regression_results <- lm(formula = Sepal.Length ~ Species, data = iris)

Without getting into regression details too much (which will be done in a few chapters), it will be sufficient here to explain that we are fitting a regression model on the Iris dataset, trying to explain the length of sepals of particular species of the iris flower. The Iris dataset is a really famous preloaded data frame included with every R base version.

Let's now have a look at this regression_results object that, as we were saying, stores the results of the regression model fitting. To find the kind of any given object, we can run the mode() function on it, passing the name of the object as a value for the argument x:

mode(x = regression_results)

This will result in:

list

Creating lists

Let's move one step back; how do we generally create lists? Here, we always use the assignment operator <-, the one we met when dealing with vectors. What is going to be different here is the function applied. It will no longer be c(), but a reasonably named list(). For instance, let's try to create two vectors and then merge them into a list:

first_vector  <- c("a","b","c")
 second_vector <- c(1,2,3)
 vector_list   <- list(first_vector, second_vector)

Subsetting lists

What if we would now like to isolate a specific object within a list? We have to employ the [[]] operator, specifying which level we would like to expose. For instance, if we would like to extrapolate only the first vector from vector_list, this would be the code:

vector_list[[2]]

Which will result in:

 [1] 1 2 3

You may be wondering, is it possible to expose a single element within a single object composing a list? The answer is yes, so let's assume that we now want to isolate the third element of the second_vector object, which is the second object composing the vector_list list. We will have to employ the [[]] operator once again:

vector_list[[2]][[3]]

Which will have the expected output:

[1] 3

Data frames

Data frames can be seen simply as lists respecting the following requisites:

All components are vectors, no matter whether logical, numerical, or character (even mixed vectors are allowed)
All vectors must be of the same length

From the mentioned rules, we can derive that data frames can be imagined, and commonly are, as tables having a certain number of columns, represented by the vectors composing them and a certain number of rows, which will coincide with the length of the vectors. While the two rules are always to be respected, no limitation is placed on the possibility of having columns of different types, such as numerical and boolean:

As you can imagine, data frames are a really convenient way to store data, especially sets of structured data, such as experimental observations or financial transactions. As we will come to better understand in the following chapters, a data frame lets us store an observation within each row and an attribute of any given observation within each column.

Even though data frames are a logical subgroup of lists, they have a full pack of tailored functions for their creation and handling.

Creating a data frame closely resembles the creation of a list, except for the different name of the function, which is once again named in a convenient way as data.frame():

a_data_frame <- data.frame(first_attribute = c("alpha","beta","gamma"), second_attribute = c(14,20,11))

Please note that every vector, that is, every column, is named by the text token preceding the = operator. There are two relevant observations on this:

Avoiding specifying the name of the vector will result in an ugly and rather unfriendly automatically assigned name, that in this case would have been c..alpha....beta....gamma.. for the first column and c.14..20..11.. for the second column. This is why it is strongly recommended to add column names.
It is also possible to give column names composed of spaced values, such as first attribute rather than first_attribute. To do so, we need to surround our column name with double quotes:

a_data_frame <- data.frame("first attribute" ...)

To be honest, I would definitely discourage you from going for the second alternative because of the annoying consequences it would create when trying to recall it in the subsequent pieces of code.

How do we select and show a column of a data frame? We employ the $ operator here:

a_data_frame$second_attribute
[1] 14 20 11

We can add new columns to the data frame in a similar way:

a_data_frame$third_attribute <- c(TRUE,FALSE,FALSE)

Functions

If we would like to put it simply, we could just say that functions are ways of manipulating vectors, lists, and data frames. This is perhaps not the most rigorous definition of a function; nevertheless, it catches a focal point of this entity—a function takes some inputs, which are vectors (even of one element), lists, or data frames, and results in one output, which is usually a vector, a list, or a data frame.

The exception here are functions that perform filesystem manipulation or some other specific tasks, which in some other languages are called procedures. For instance, the file.create() function we encountered before.

One of the most appreciated features of R is the possibility to easily explore the definition of all the functions available. This is easily done by submitting a command with the sole name of the function, without any parentheses. Let's try this with the mode() function and see what happens:

mode

function (x)
 {
  if (is.expression(x))
   return("expression")
  if (is.call(x))
   return(switch(deparse(x[[1L]])[1L], `(` = "(", "call"))
  if (is.name(x))
   "name"
 else switch(tx <- typeof(x), double = , integer = "numeric",
   closure = , builtin = , special = "function", tx)
 }
 <bytecode: 0x102264c98>
 <environment: namespace:base>

We are not going to get into detail with this function, but let's just notice some structural elements:

We have a call to function(), which, by the way, is a function itself.
We have the specification of the only argument of the mode function, which is x.
We have braces surrounding everything coming after the function() call. This is the body of the function and contains all the calculations/computations performed by the function on its inputs.

Those are the actual, minimal elements for the definition of a function within the R language. We can resume this as follows:

function_name <- function(arguments){
    [function body]
}

Now that we know the theory, let's try to define a simple and useless function that adds 2 to every number submitted:

adding_two <- function(the_number){
the_number + 2}

Does it work? Of course it does. To test it, we have to first execute the two lines of code stating the function definition, and then we will be able to employ our custom function:

adding_two( the_number = 4)
[1] 6

Now, let's introduce a bit more complicated but relevant concept: value assignment within a function. Let's imagine that you are writing a function and having the result stored within a function_result vector. You would probably write something like this:

my_func <- function(x){
function_result <- x / 2 }

You may even think that, once running your function, for instance, with x equal to 4, you should find an object function_result equal to 2 (4/2) within your environment.

So, let's try to print it out in the way that we learned some paragraphs earlier:

function_result

This is what happens:

Error: object function_result not found

How is this possible? This is actually because of the rules overseeing the assignment of values within a function. We can summarize those rules as follows:

A function can look up a variable, even if defined outside the function itself
Variables defined within the function remain within the function

How is it therefore possible to export the function_result object outside the function? You have two possible ways:

Employing the <<- operator, the so-called superassignment operator
Employing the assign() function

Here is the function rewritten to employ the superassignment operator:

my_func <- function(x){
  function_result <<- x / 2 }

If you try to run it, you will now find that the function_result object will show up within your environment browser. One last step: exporting an object created within a function outside of the function is different than placing that object as a result of the function. Let's show this practically:

my_func <- function(x){
  function_result <- x / 2
  function_result}

If you now try to run my_func(4) once again, your console will print out the result:

[1] 2

But, within your environment, once again you will not find the function_result object. How is this? This is because within the function definition, you specified as a final result, or as a resulting value, the value of the function_result object. Nevertheless, as in the first formulation, this object was defined employing a standard assignment operator.

R's weaknesses and how to overcome them

When talking about R to an experienced tech guy, he will probably come out with two main objections to the language:

Its steep learning curve
Its difficulty in handling large datasets

You will soon discover that those are actually the two main weaknesses of the language. Nevertheless, not even pretending that R is a perfect language, we are going to tackle those weaknesses here, showing effective ways to overcome them. We can actually consider the first of the mentioned objections temporary, at least on an individual basis, since once the user gets through the valley of despair, he will never come back to it and the weakness will be forgotten. You do not know about the valley of despair? Let me show you a plot, and then we can discuss it:

It is common wisdom that every man who starts to learn something new and complex enough will go through three different phases:

The honeymoon, where he falls in love with the new stuff and feels confident to be able to easily master it
The valley of despair, where everything starts looking impossible and disappointing
During the rest of the story, where he starts having a more realistic view of the new topic, his mastery of it starts increasing, and so does his level of confidence

Moving on to the second weakness, we have to say that R's difficulty in handling large datasets is a rather more structural aspect of the language, and therefore requires some structural changes to the language, and strategical cooperation between it and other tools. In two new paragraphs, we will go through both of the aforementioned weaknesses.

Learning R effectively and minimizing the effort

First of all, why is R perceived as a language that is difficult to learn? We don't have a universally accepted answer to this question. Nevertheless, we can try some reasoning on it. R is the main choice when talking about statistical data analysis and was indeed born as a language by statisticians for statisticians, and specifically for statistics students. This produced two specific features of the language:

No great care for the coding experience
A previously unseen range of statistical techniques applicable with the language, with an unprecedented level of interaction

Here, we can find reasons for the perceived steep learning curve: R wasn't conceived as a coder-friendly language, as, for instance, Julia and Swift were. Rather, it was an instrument born within the academic field for academic purposes, as we mentioned before. R's creators probably never expected their language to be employed for website development, as is the case today (you can refer to Chapter 13, Sharing your stories with your stakeholders through R markdown; take a look at the Shiny apps on this).

The second point is the feeling of disorientation that affects people, including statisticians, coming to R from other statistical analysis languages. Applying a statistical model to your data through R is an amazingly interactive process, where you get your data into a model, get results, and perform diagnostics on it. Then, you iterate once again or perform cross-validation techniques, all with a really high level of flexibility. This is not exactly what an SAS or SPSS user is used to. Within these two languages, you just take your data, send it to a function, and wait for a comprehensive and infinite set of results.

Is this the end of the story? Do we need to passively accept this history-rooted steep learning curve? Of course we don't, and the R community is actually actively involved in the task of leveling this curve, following two main paths:

Improving the R coding experience
Developing high-quality learning materials

The tidyverse

Due to it being widespread throughout the R community, it is almost impossible nowadays to talk about R without mentioning the so-called tidyverse. This original name stands for a framework of concepts and functions developed mainly by Hadley Wickham to bring R closer to a modern programming experience. Introducing you to the magical world of the tidyverse is out of the scope of this book, but I would like to briefly explain how the framework is composed. Within the tidyverse, at least the four following packages are usually included:

readr: For data import
dplyr: For data manipulation
tidyr: For data cleaning
ggplot2: For data visualization

Due to its great success, an ever-increasing amount of learning material has been created on this topic, and this leads us to the next paragraph.

Leveraging the R community to learn R

One of the most exciting aspects of the R world is the vital community surrounding it. In the beginning, the community was mainly composed of statisticians and academics who encountered this powerful tool through the course of their studies. Nowadays, while statisticians and academics are still in the game, the R community is also full of a great variety of professionals from different fields: from finance, to chemistry and genetics. It is commonly acknowledged that its community is one of the R language's peculiarities. This community is also a great asset for every newbie of the language, since it is composed of people who are generally friendly, rather than posh, and open to helping you with your first steps in the language. I guess this is, generally speaking, good news, but you may be wondering: How do I actually leverage this amazing community you are introducing me to? First of all, let us find them, looking at places - both virtual and physical - where you can experience the community. We will then look at practical ways to leverage community-driven content to learn R.

Where to find the R community

There are different places, both physical and virtual, where it is possible to communicate with the R community. The following is a tentative list to get you up and running:

Virtual places:

R-bloggers
Twitter hashtag #rstats
Google+ community
Stack Overflow R tagged questions
R-help mailing list

Physical places:

The annual R conference
The RStudio developer conference
The R meetup

Engaging with the community to learn R

Now that we know where to find the community, let's take a closer look at how to take advantage of it. We can distinguish three alternative and non-exclusive ways:

Employing community-driven learning material
Asking for help from the community
Staying ahead of language developments

Employing community-driven learning material: There are two main kinds of R learning materials developed by the community:

Papers, manuals, and books
Online interactive courses

Papers, manuals, and books: The first one is for sure the more traditional one, but you shouldn't neglect it, since those kinds of learning materials are always able to give you a more organic and systematic understanding of the topics they treat. You can find a lot of free material online in the form of papers, manuals, and books.

Let me point out to you the more useful ones:

Advanced R
R for Data Science
Introduction to Statistical Learning
OpenIntro Statistics
The R Journal

Online interactive courses: This is probably the most common learning material nowadays. You can find different platforms delivering good content on the R language, the most famous of which are probably DataCamp, Udemy, and Packt itself. What all of them share is a practical and interactive approach that lets you learn the topic directly, applying it through exercises rather than passively looking at someone explaining theoretical stuff.

Asking for help from the community: As soon as you start writing your first lines of R code, and perhaps before you even actually start writing it, you will come up with some questions related to your work. The best thing you can do when this happens is to resort to the community to solve those questions. You will probably not be the first one to come up with that question, and you should therefore first of all look online for previous answers to your question.

Where should you look for answers? You can look everywhere, but most of the time you will find the answer you are looking for on one of the following (listed by the probability of finding the answer there):

Stack Overflow
R-help mailing list
R packages documentation

I wouldn't suggest you look for answers on Twitter, G+, and similar networks, since they were not conceived to handle these kinds of processes and you will expose yourself to the peril of reading answers that are out of date, or simply incorrect, because no review system is considered.

If it is the case that you are asking an innovative question never previously asked by anyone, first of all, congratulations! That said, in that happy circumstance, you can ask your question in the same places that you previously looked for answers.

Staying ahead of language developments: The R language landscape is constantly changing, thanks to the contributions of many enthusiastic users who take it a step further every day. How can you stay ahead of those changes? This is where social networks come in handy. Following the #rstats hashtag on Twitter, Google+ groups, and similar places, will give you the pulse of the language. Moreover, you will find the R-bloggers aggregator, which delivers a daily newsletter comprised of the R-related blog posts that were published the previous day really useful. Finally, annual R conferences and similar occasions constitute a great opportunity to get in touch with the most notorious R experts, gaining from them useful insights and inspiring speeches about the future of the language.

Handling large datasets with R

The second weakness of those mentioned earlier was related to the handling of large datasets. Where does this weakness come from? It is something actually related to the core of the language—R is an in-memory software. This means that every object created and managed within an R script is stored within your computer RAM. This means that the total size of your data cannot be greater than the total size of your RAM (assuming that no other software is consuming your RAM, which is unrealistic). Answers to this problem are actually out of the scope of this book. Nevertheless, we can briefly summarize them into three main strategies:

Optimizing your code, profiling it with packages such as profvis, and applying programming best practices.
Relying on external data storage and wrangling tools, such as Spark, MongoDB, and Hadoop. We will reason a bit more on this in later chapters.
Changing R memory handling behavior, employing packages such as ff, filehash, R.huge, or bigmemory, that try to avoid RAM overloading.

The main point I would like to stress here is that even this weakness is actually superable. You should bear this in mind when you encounter it for the first time on your R mastery journey.

One final note: as long as the computational power price is getting lower, the issue related to large dataset handling will become a more negligible one.

Chris H Feb 03, 2018

There are countless books out there that attempt to start the reader along the learning curve with data science/machine learning tools. This is one of the very best that I have seen as an introduction to the field. It is exceedingly clear in its presentation and takes great care to explain why each step or manipulation is done.

Amazon Verified review