Home

Programming

R Programming By Example

By Omar Trejo Navarro

Book

eBook $43.99 $29.99

Print $54.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook $43.99 $29.99

Print $54.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

R is a high-level statistical language and is widely used among statisticians and data miners to develop analytical applications. Often, data analysis people with great analytical skills lack solid programming knowledge and are unfamiliar with the correct ways to use R. Based on the version 3.4, this book will help you develop strong fundamentals when working with R by taking you through a series of full representative examples, giving you a holistic view of R. We begin with the basic installation and configuration of the R environment. As you progress through the exercises, you'll become thoroughly acquainted with R's features and its packages. With this book, you will learn about the basic concepts of R programming, work efficiently with graphs, create publication-ready and interactive 3D graphs, and gain a better understanding of the data at hand. The detailed step-by-step instructions will enable you to get a clean set of data, produce good visualizations, and create reports for the results. It also teaches you various methods to perform code profiling and performance enhancement with good programming practices, delegation, and parallelization. By the end of this book, you will know how to efficiently work with data, create quality visualizations and reports, and develop code that is modular, expressive, and maintainable.

Publication date:: December 2017
Publisher: Packt
Pages: 470
ISBN: 9781788292542
Download code from GitHub

Introduction to R

In a world where data is becoming increasingly important, business people and scientists need tools to analyze and process large volumes of data efficiently. R is one of the tools that has become increasingly popular in recent years for data processing, statistical analysis, and data science, and while R has its roots in academia, it is now used by organizations across a wide range of industries and geographical areas.

Some of the important topics covered in this chapter are as follows:

History of R and why it was designed the way it was
What the interpreter and the console are and how to use them
How to work with basic data types and data structures of R
How to divide work by using functions in different ways
How to introduce complex logic with control structures

What R is and what it isn't

When it comes to choosing software for statistical computing, it's tough to argue against R. Who could dislike a high quality, cross-platform, open source, statistical software product? It has an interactive console for exploratory work. It can run as a scripting language to replicate processes. It has a lot of statistical models built in, so you don't have to reinvent the wheel, but when the base toolset is not enough, you have access to a rich ecosystem of external packages. And, it's free! No wonder R has become a favorite in the age of data.

The inspiration for R – the S language

R was inspired by the S statistical language developed by John Chambers at AT&T. The name S is an allusion to another one-letter-name programming language also developed at AT&T, the famous C language. R was created by Ross Ihaka and Robert Gentleman in the Department of Statistics at the University of Auckland in 1991.

The general S philosophy sets the stage for the design of the R language itself, which many programmers coming from other programming languages find somewhat odd and confusing. In particular, it's important to realize that S was developed to make data analysis as easy as possible.

"We wanted users to be able to begin in an interactive environment, where they did not consciously think of programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important."

– John Chambers

The key part here is the transition from analyst to developer. They wanted to build a language that could easily service both types of users. They wanted to build language that would be suitable for interactive data analysis through a command line but which could also be used to program complex systems, like traditional programming languages.

It's no coincidence that this book is structured that way. We will start doing data analysis first, and we will gradually move toward developing a full and complex system for information retrieval with a web application on top.

R is a high quality statistical computing system

R is comparable, and often superior, to commercial products when it comes to programming capabilities, complex systems development, graphic production, and community ecosystems. Researchers in statistics and machine learning, as well as many other data-related disciplines, will often publish R packages to accompany their publications. This translates into immediate public access to the very latest statistical techniques and implementations. Whatever model or graphic you're trying to develop, chances are that someone has already tried it, and if not, you can at least learn from their efforts.

R is a flexible programming language

As we have seen, in addition to providing statistical tools, R is a general-purpose programming language. You can use R to extend its own functionality, automate processes that make use of complex systems, and many other things. It incorporates features from other object-oriented programming languages and has strong foundations for functional programming, which is well suited for solving many of the challenges of data analysis. R allows the user to write powerful, concise, and descriptive code.

R is free, as in freedom and as in free beer

In many ways, a language is successful inasmuch as it creates a platform with which many people can create new things, and R has proven to be very successful in this regard. One key limitation of the S language was that it was only available in a commercial package, but R is free software. Free as in freedom, and free as in free beer.

The copyright for the primary source code for R is held by the R Foundation and is published under General Public License (GPL). According to the Free Software Foundation (http://www.fsf.org/), with free software (free as in freedom) you are granted the following four freedoms:

Freedom 0: Run the program for any purpose
Freedom 1: Study how the program works and adapt it to your needs
Freedom 2: Redistribute copies so you can help your neighbor
Freedom 3: Improve the program and release your improvements to the public

These freedoms have allowed R to develop strong prolific communities that include world-class statisticians and programmers as well as many volunteers, who help improve and extend the language. They also allow for R to be developed and maintained for all popular operating systems, and to be easily used by individuals and organizations who wish to do so, possibly sharing their findings in a way that others can replicate their results. Such is the power of free software.

What R is not good for

No programming language or system is perfect. R certainly has a number of drawbacks, the most common being that it can be painfully slow (when not used correctly). Keep in mind that R is essentially based on 40-year-old technology, going back to the original S system developed at Bell Labs. Therefore, several of its imperfections come from the fact that it was not built in anticipation for the data age we live in now. When R was born, disk and RAM were very expensive and the internet was just getting started. Notions of large-scale data analysis and high-performance computing were rare.

Fast-forward to the present, hardware cost is just a fraction of what it used to be, computing power is available online for pennies, and everyone is interested in collecting and analyzing data at large scale. This surge in data analysis has brought to the forefront two of R's fundamental limitations, the fact that it's single-threaded and memory-bound. These two characteristics drastically slow it down. Furthermore, R is an interpreted dynamically typed language, which can make it even slower. And finally, R has object immutability and various ways to implement object-oriented programming, both of which can make it hard for people, specially those coming from other languages, to write high-quality code if they don't know how to deal with them. You should know that all of the characteristics mentioned in this paragraph are addressed in Chapter 9, Implementing an Efficient Simple Moving Average.

A double-edged sword in R, is that most of its users do not think of themselves as programmers, and are more concerned with results than with process (which is not necessarily a bad thing). This means that much of the R code you can find online is written without regard for elegance, speed, or readability, since most R users do not revise their code to address these shortcomings. This permeates into code that is patchy and not rigorously tested, which in turn produces many edge cases that you must take into account when using low-quality packages. You will do well to keep this in mind.

Comparing R with other software

My intention for this section is not to provide a comprehensive comparison between R and other software, but to simply point out a few of R's most noticeable features. If you can, I encourage you to test other software yourself so that you know first-hand what may be the best tool for the job at hand.

The most noticeable feature of R compared to other statistical software such as SAS, Stata, SPSS, and even Python, is the very large number of packages that it has available. At the time of writing this, there are almost 12,000 packages published in The Comprehensive R Archive Network (CRAN) (https://cran.r-project.org/), and this does not include packages published in other places, such as Git repositories. This enables R to have a very large community and a huge number of tools for data analysis in areas such as finance, mathematics, machine learning, high-performance computing, and many others.

With the exception of Python, R has much more programming capabilities than SAS, Stata, SPSS, and even more so than Python in some respects (for example, in R, you may use different object models). However, efficient and effective R usage requires the use of code which implies a steep learning curve for some people, while Stata and SPSS have graphical user interfaces that guide the user through many of the tasks with point-and-click wizards. In my opinion, this hand-holding, although nice for beginners, quickly becomes an important restriction for people looking to become intermediate to advanced users, and that's where the advantage of programming really shines.

R has one of the best graphics systems among all existing software. The most popular package for producing graphs in R, which we will use extensively in this book, is the ggplot2 package, but there are many other great graphing packages as well. This package allows the modification of virtually every aspect of a graph through its graphics grammar, and is far superior to anything I've seen in SPSS, Stata, SAS, or even Python.

R is a great tool, but it's not the right tool for everything. If you're looking to perform data analysis but don't want to invest the time in learning to program, then software like SAS, Stata, or SPSS may be a better option for you. If you're looking to develop analytical software that is very easily integrated into larger systems and which needs to plug into various interfaces, then Python may be a better tool for the job. However, if you're looking to do a lot of complex data analysis and graphing, and you are going to mostly spend your time focused on these areas, then R is a great choice.

The interpreter and the console

As I mentioned earlier, R is an interpreted language. When you enter an expression into the R console or execute an R script in your operating system's terminal, a program called the interpreter parses and executes the code. Other examples of interpreted languages are Lisp, Python, and JavaScript. Unlike C, C++, and Java, R doesn't require you to explicitly compile your programs before you execute them.

All R programs are composed of a series of expressions. The interpreter begins by parsing each expression, substituting objects for symbols where appropriate, evaluates them, and finally return the resulting objects. We will define each of these concepts in the following sections, but you should understand that this is the basic process through which all R programs go through.

The R console is the most important tool for using R and can be thought of as a wrapper around the interpreter. The console is a tool that allows you to type expressions directly into R and see how it responds. The interpreter will read the expressions and respond with a result or an error message, if there was one. When you execute expressions through the console, the interpreter will pass objects to the print() function automatically, which is why you can see the result printed below your expressions (we'll cover more on functions later).

If you've used a command line before (for example, bash in Linux of macOS or cmd.exe in Windows) or a language with an interactive interpreter such as Lisp, Python, or JavaScript, the console should look familiar since it simply is a command-line interface. If not, don't worry. Command-line interfaces are simple to use tools. They are programs that receive code and return objects whose printed representations you see below the code you execute.

When you launch R, you will see a window with the R console. Inside the console you will see a message like the one shown below. This message displays some basic information, including the version of R you're running, license information, reminders about how to get help, and a Command Prompt.

Note that the R version in this case is 3.4.2. The code developed during this book will assume this version. If you have a different version, but in case you end up with some problems, this could be a reason you may want to look into.

You should note that, by default, R will display a greater-than sign (>) at the beginning of the last line of the console, signaling you that it's ready to receive commands. Since R is prompting you to type something, this is called a Command Prompt. When you see the greater-than symbol, R is able to receive more expressions as input. When you don't, it is probably because R is busy processing something you sent, and you should wait for it to finish before sending something else.

For the most part, in this book we will avoid showing such command prompts at all, since you may be typing the code into a source code file or directly into the console, but if we do introduce it, make sure that you don't explicitly type it. For example, if you want to replicate the following snippet, you should only type 1 + 2 in your console, and press the Enter key. When you do, you will see a [1] 3 which is the output you received back from R. Go ahead and execute various arithmetic expressions to get a feel for the console:

> 1 + 2
[1] 3

Note the [1] that accompanies each returned value. It's there because the result is actually a vector (an ordered collection). The [1] means that the index of the first item displayed in that row is 1 (in this case, our resulting vector has a single value within).

Finally, you should know that the console provides tools for looking through previous commands. You will probably find that the up and down arrow keys are the most useful. You can scroll through previous commands by pressing them. The up arrow lets you look at earlier commands, and the down arrow lets you look at later commands. If you would like to repeat a previous command with a minor change, or if you need to correct a mistake, you can easily do so using these keys.

Tools to work efficiently with R

In this section we discuss the tools that will help us when working with R.

Pick an IDE or a powerful editor

For efficient code development, you may want to try a more powerful editor or an Integrated Development Environment (IDE). The most popular IDE for R is RStudio (https://www.rstudio.com/). It offers an impressive feature set that makes interacting with R much easier. If you're new to R, and programming in general, this is probably the way to go. As you can see in the image below it wraps the console (right side) within a larger application which offers a lot of functionality, and in this case, it is displaying the help system (left side). Furthermore, RStudio offers tabs to navigate files, browse installed packages, visualize plots, among other features, as well as a large amount of configuration options under the top menu dropdowns.

Throughout this book, we will not use any functionality provided by RStudio. All I will show you is pure R functionality. I decided to proceed this way to make sure that the book is useful for any R programmer, including those who do not use RStudio. For RStudio users, this means that there may be easier ways to accomplish some of the tasks I will show, and instead of programming a few lines, you could simply click some buttons. If that's something you prefer, I encourage you to take a look through the excellent RStudio Essential webinars,which can be found in RStudio's website at https://www.rstudio.com/resources/webinars/?wvideo=lxel3j2kos, as well as Stanford's Introduction to R, RStudio (https://web.stanford.edu/class/stats101/intro/intro-lab01.html).

You should be careful to avoid the common mistake of referring to R as RStudio. Since many people are introduced to R through RStudio, they think that RStudio is actually R, which it is not. RStudio is a wrapper around R to extend it's functionality, and is technically known as an IDE.

Experienced programmers may prefer to work with other tools they already know and love and have used for many years. For example, in my case, I prefer to use Emacs (https://www.gnu.org/software/emacs/) for any programming I do. Emacs is a very powerful text editor that you can programatically extend to work the way you want it to by using a programming language known as Elisp, which is a Lisp extension. In case you use Emacs too, the ess package is all you really need.

If you're going to use Emacs, I encourage you to take a look through the ess package's documentation (https://ess.r-project.org/Manual/ess.html) and Johnson's presentation titled Emacs Has No Learning Curve, University of Kansas, 2015 (http://pj.freefaculty.org/guides/Rcourse/emacs-ess/emacs-ess.pdf). If you use Vim, Sublime Text, Atom, or other similar tools, I'm confident you can find useful packages as well.

The send to console functionality

The base R installation provides the console environment we mentioned in the previous section. This console is really all you need to work with R, but it will quickly become cumbersome to type everything directly into it and it may not be your best option. To efficiently work with R, you need to be able to experiment and iterate as fast as you can. Doing so will accelerate your learning curve and productivity.

Whichever tool you use, the key functionality you need is to be able to easily send code snippets into the console without having to type them yourself, or copying them from your editor and pasting them into the console. In RStudio, you can accomplish this by clicking on the Run or Source button in the top-right corner of the code editor panel. In Emacs, you may use the ess-eval-region command.

The efficient write-execute loop

One of the most productive ways to work with R, especially when learning it, is to use the write-execute loop, which makes use of the send to console functionality mentioned in the previous section. This will allow you to do two very important things: develop your code through small and quick iterations, which allow you to see step-by-step progress until you converge to the behavior you seek, and save the code you converged to as your final result, which can be easily reproduced using the source code file you used for your iterations. R source code files use the .R extension.

Assuming you have a source code file ready to send expressions to the console, the basic steps through the write-execute loop are as follows:

Define what behavior you're looking to implement with code.
Write the minimal amount of code necessary to achieve one piece of the behavior you seek in your implementation.
Use the send to console functionality to verify that the result in the console is what you expected, and if it's not, to identify possible causes.
If it's not what you expected, go back to the second step with the purpose of fixing the code until it has the intended piece of behavior.
If it's what you expected, go back to the second step with the purpose of extending the code with another piece of the behavior, until convergence.

This write-execute loop will become second nature to you as you start using it, and when it does, you'll be a more productive R programmer. It will allow you to diagnose issues faster, to quickly experiment with a few ways to accomplishing the same behavior to find which one seems best for your context, and once you have working code, it will also allow you to clean your implementation to keep the same behavior but have better or more readable code.

For experienced programmers, this should be a familiar process, and it's very similar to Test-Driven Development (TDD), but instead of using unit-tests to automatically test the code, you verify the output in the console in each iteration, and you don't have a set of tests to re-test each iteration. Even though TDD will not be used in this book, you can definitely use it in R.

I encourage you to use this write-execute loop to work through the examples presented in this book. At times, we will show step-by-step progress so that you understand the code better, but it's practically impossible to show all of the write-execute loop iterations I went through to develop it, and much of the knowledge you can acquire comes from iterating this way.

Executing R code in non-interactive sessions

Once your code has the functionality you were looking to implement, executing it through an interactive session using the console may not be the best way to do so. In such cases, another option you have is to tell your computer to directly execute the code for you, in a non-interactive session. This means that you won't be able to type commands into the console, but you'll get the benefit of being able to configure your computer to automatically execute code for you, or to integrate it into larger systems where R is only one of many components. This is known as batch mode.

To execute code in the batch mode, you have two options: the old R CMD BATCH command which we won't look into, and the newer Rscript command, which we will. The Rscript is a command that you can execute within your computer's terminal. It receives the name of a source code file and executes its contents.

In the following example, we will make use of various concepts that we will explain in later sections, so if you don't feel ready to understand it, feel free to skip it now and come back to it later.

Suppose you have the following code in a file named greeting.R. It gets the arguments passed through the command line to Rscript through the args object created with the commandArgs() function, assigns the corresponding values to the greeting and name variables, and finally prints a vector that contains those values.

args     <- commandArgs(TRUE)
greeting <- args[1]
name     <- args[2]

print(c(greeting, name))

Once ready, you may use the Rscript command to execute it from your Terminal (not from within your R console) as is shown ahead. The result shows the vector with the greeting and name variable values you passed it.

When you see a Command Prompt that begins with the $ symbol instead of of the > symbol, it means that you should execute that line in your computer's Terminal, not in the R console.

$ Rscript greeting.R Hi John
[1] "Hi" "John"

Note that if you simply execute the file without any arguments, they will be passed as NA values, which allows you to customize your code to deal with such situations:

$ Rscript greeting.R
[1] NA NA

This was a very simple example, but the same mechanism can be used to execute much more complex systems, like the one we will build in the final chapters of this book to constantly retrieve real-time price data from remote servers.

Finally, if you want to provide a mechanism that is closer to the one in Python, you may want to look into the optparse package to create command-line help pages as well as to parse arguments.

How to use this book

To make the most out of this book, you should recreate on your own the examples shown throughout, and make sure that you understand what each of them is doing in detail. If at some point you feel confused, it's not too difficult to do a couple of searches online to clarify things for yourself. However, I highly recommend that you look into the following books as well, which go into more detail on some of the concepts and ideas presented in this book, and are considered very good references for R programmers:

R in a Nutshell, by Adler, O'Reilly, 2010
The Art of R Programming, by Matloff, No Starch Press, 2011
Advanced R, by Wickham, CRC Press, 2015
R Programming for Data Science, by Peng, LeanPub, 2016

Sometimes all you need to do to clarify something is use R's help system. To get help on a function, you may use the question mark notation, like ?function_name, but in case you want to search for help on a topic, you may use the help.search() function, like help.search (regression). This can be helpful if you know what topic you're interested in but can't remember the actual name of the function you want to use. Another way of invoking such functionality is using the double question mark notation, like ?? regression.

Keep in mind that topics in this book are interconnected and not linearly ordered, which means that at times it will seem that we are jumping around. When that happens, it's because a topic can be seen through different points of view. That's why, to make the most out of this book, you should experiment as much as you can in the console and build code progressively using the write-execute loop mentioned earlier. If you simply replicate the code exactly as is shown, you may miss some of the learning that you could have gotten had you built the systems step by step.

Finally, you should know that this book is meant to show how to use R through somewhat real examples, and as such, does not provide too much technical depth or discussion on some of the topics presented. Furthermore, since my objective is to get you quickly working with the real examples, in this first chapter, I explain R fundamentals very briefly, just to introduce the minimum amount of knowledge you need to follow through the real examples presented in the following chapters. Therefore, you should not think that the explanations presented in this chapter are enough for you to understand R's basic constructs. If you're looking for a more in-depth introduction to R fundamentals, you may want to take a look at the references we mentioned previously.

Tracking state with symbols and variables

Like most programming languages, R lets you assign values to variables and refer to these objects by name. The names you use to refer to variables are called symbols in R. This allows you to keep some information available in case it's needed at a later point in time. These variables may contain any type of object available in R, even combinations of them when using lists, as we will see in a later section in this chapter. Furthermore, these objects are immutable, but that's a topic for Chapter 9, Implementing an Efficient Simple Moving Average.

In R, the assignment operator is <-, which is a less-than symbol (<) followed by a dash (-). If you have worked with algorithm pseudo code before, you may find it familiar. You may also use the single equals symbol (=) for assignments, similar to many other languages, but I prefer to stick to the <- operator.

An expression like x <- 1 means that the value 1 is assigned to the x symbol, which can be thought of as a variable. You can also assign the other way around, meaning that with an expression like 1 -> x we would have the same effect as we did earlier. However, the assignment from left to right is very rarely used, and is more of a convenience feature in case you forget the assignment operator at the beginning of a line in the console.

Note that the value substitution is done at the time when the value is assigned to z, not at the time when z is evaluated. If you enter the following code into the console, you can see that the second time z is printed, it still has the value that y had when it was used to assign to it, not the y value assigned afterward:

x <- 1
y <- 2
z <- c(x, y)
z
#> [1] 1 2

y <- 3
z
#> [1] 1 2

It's easy to use variable names like x, y, and z, but using them has a high cost for real programs. When you use names like that, you probably have a very good idea of what values they will contain and how they will be used. In other words, their intention is clear for you. However, when you give your code to someone else or come back to it after a long period of time, those intentions may not be clear, and that's when cryptic names can be harmful. In real programs, your names should be self descriptive and instantly communicate intention.

For a deeper discussion about this and many other topics regarding high-quality code, take a look at Martin's excellent book Clean Code: A Handbook of Agile Software Craftsmanship, Prentice Hall, 2008.

Standard object names in R should only contain alphanumeric characters (numbers and ASCII letters), underscores (_), and, depending on context, even periods (.). However, R will allow you to use very cryptic strings if you want. For example, in the following code, we show how the variable !A @B #C $D %E ^F name is used to contain a vector with three integers. As you can see, you are even allowed to use spaces. You can use this non-standard name provided that you wrap the string with backticks (`):

`!A @B #C $D %E ^F` <- c(1, 2, 3)
`!A @B #C $D %E ^F`
#> [1] 1 2 3

It goes without saying that you should avoid those names, but you should be aware they exist because they may come in handy when using some of R's more advanced features. These types of variable names are not allowed in most languages, but R is flexible in that way. Furthermore, the example goes to show a common theme around R programming: it is so flexible that if you're not careful, you will end up shooting yourself in the foot. It's not too rare for someone to be very confused about some code because they assumed R would behave a certain way (for example, raise an error under certain conditions) but don't explicitly test for such behavior, and later find that it behaves differently.

Working with data types and data structures

This section summarizes the most important data types and data structures in R. In this brief overview, we won't discuss them in depth. We will only show a couple of examples that will allow you to understand the code shown throughout this book. If you want to dig deeper into them, you may look into their documentation or some of the references pointed out in this chapter's introduction.

The basic data types in R are numbers, text, and Boolean values (TRUE or FALSE), which R calls numerics, characters, and logicals, respectively. Strictly speaking, there are also types for integers, complex numbers, and raw data (bytes), but we won't use them explicitly in this book. The six basic data structures in R are vectors, factors, matrices, data frames, and lists, which we will summarize in the following sections.

Numerics

Numbers in R behave pretty much as you would mathematically expect them to. For example, the operation 2 / 3 performs real division, which results in 0.6666667 in R. This natural numeric behavior is very convenient for data analysis, as you don't need to pay too much attention when using numbers of different types, which may require special handling in other languages. Also the mathematical priorities for operators applies, as well the use of parenthesis.

The following example shows how variables can be used within operations, and how operator priorities are handled. As you can see, you may mix the use of variables with values when performing operations:

x <- 2
y <- 3
z <- 4
(x * y + z) / 5
#> [1] 2

The modulo operation can be performed with the %% symbol, while integer division can be performed with the %/% symbol:

7 %% 3
#> [1] 1
7 %/% 3
#> [1] 2

Special values

There are a few special values in R. The NA values are used to represent missing values, which stands for not available. If a computation results in a number that is too big, R will return Inf for a positive number and -Inf for a negative number, meaning positive and negative infinity, respectively. These are also returned when a number is divided by 0. Sometimes a computation will produce a result that makes little sense. In these cases, we will get a NaN, which stands for not a number. And, finally, there is a null object, represented by NULL. The symbol NULL always points to the same object (which is a data type on its own) and is often used as a default argument in functions to mean that no value was passed through. You should know that NA, Inf, -Inf, NaN, and NULL are not substitutes for each other.

There are specific NA values for numerics, characters, and logicals, but we will stick to the simple NA, which is internally treated as a logical.

In the following example, you can see how these special values behave when used among themselves in R. Note that 1 / 0 results in Inf, 0 / 0, Inf - Inf, and Inf / Inf results in undefined represented by NaN, but Inf + Inf, 0 / Inf, and Inf / 0, result in Inf, 0, and Inf, respectively. It's no coincidence that these results resemble mathematical definitions. Also note that any operation including NaN or NA will also result in NaN and NA, respectively:

1 / 0
#> [1] Inf
-1 / 0
#> [1] -Inf
0 / 0
#> [1] NaN
Inf + Inf
#> [1] Inf
Inf - Inf
#> [1] NaN
Inf / Inf
#> [1] NaN
Inf / 0
#> [1] Inf
0 / Inf
#> [1] 0
Inf / NaN
#> [1] NaN
Inf + NA
#> [1] NA

Characters

Text can be used just as easily, you just need to remember to use quotation marks (" ") around it. The following example shows how to save the text Hi, there! and "10" in two variables. Note that since "10.5" is surrounded by quotation marks, it is text and not a numeric value. To find what type of object you're actually dealing with you can use the class(), typeof(), and str() (short for structure) functions to get the metadata for the object in question.

In this case, since the y variable contains text, we can't multiply it by 2, as is seen in the error we get. Also, if you want to know the number of characters in a string, you can use the nchar() function, as follows:

x <- "Hi, there!"
y <- "10"
class(y)
#> [1] "character"
typeof(y)
#> [1] "character"
str(y)
#> chr "10"
y * 2
#> Error in y * 2: non-numeric argument to binary operator
nchar(x)
#> [1] 10
nchar(y)
#> [1] 2

Sometimes, you may have text information, as well as numeric information that you want to combine into a single string. In this case, you should use the paste() function. This function receives an arbitrary number of unnamed arguments, which is something we will define more precisely in a later section in this chapter. It then transforms each of these arguments into characters, and returns a single string with all of them combined. The following code shows such an example. Note how the numeric value of 10 in y was automatically transformed into a character type so that it could be pasted inside the rest of the string:

x <- "the x variable"
y <- 10
paste("The result for", x, "is", y)
#> [1] "The result for the x variable is 10"

Other times, you will want to replace some characters within a piece of text. In that case, you should use the gsub() function, which stands for global substitution. This function receives the string to be replaced as its first argument, the string to replace with as its second argument, and it will return the text in the third argument with the corresponding replacements:

x <- "The ball is blue"
gsub("blue", "red", x)
#> [1] "The ball is red"

Yet other times, you will want to know whether a string contains a substring, in which case you should use the gprel() function. The name for this function comes from terminal command known as grep, which is an acronym for global regular expression print (yes, you can also use regular expressions to look for matches). The l at the end of grepl() comes from the fact that the result is a logical:

x <- "The sky is blue"
grepl("blue", x)
#> [1] TRUE
grepl("red", x)
#> [1] FALSE

Logicals

Logical vectors contain Boolean values, which can only be TRUE or FALSE. When you want to create logical variables with such values, you must avoid using quotation marks around them and remember that they are all capital letters, as shown here. When programming in R, logical values are commonly used to test a condition, which is in turn used to decide which branch from a complex program we should take. We will look at examples for this type of behavior in a later section in this chapter:

x <- TRUE

In R, you can easily convert values among different types with the as.*() functions, where * is used as a wildcard which can be replaced with character, numeric, or logical to convert among these types. The functions work by receiving an object of a different type from what the function name specifies and return the object parsed into the specified type if possible, or return an NA if it's not possible. The following example shows how to convert the TRUE string into a logical value, which in this case non-surprisingly turns out to be the logical TRUE:

as.logical("TRUE")
#> [1] TRUE

Converting from characters and numerics into logicals is one of those things that is not very intuitive in R. The following table shows some of this behavior. Note that even though the true string (all lowercase letters) is not a valid logical value when removing quotation marks, it is converted into a TRUE value when applying the as.logical() to it, for compatibility reasons. Also note that since T is a valid logical value, which is a shortcut for TRUE, it's corresponding text is also accepted as meaning such a value. The same logic applies to false and F. Any other string will return an NA value, meaning that the string could not be parsed as a logical value. Also note that 0 will be parsed as FALSE, but any other numeric value, including Inf, will be converted to a TRUE value. Finally, note that both NA and NaN will be parsed, returning NA in both cases.

The as.character() and as.numeric() functions have less counter-intuitive behavior, and I will leave you to explore them on your own. When you do, try to test as many edge cases as you can. Doing so will help you foresee possible issues as you develop your own programs.

Before we move on, you should know that these data structures can be organized by their dimensionality and whether they're homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types). Vectors, matrices, and arrays are homogeneous data structures, while lists and data frames are heterogeneous. Vectors and lists have a single dimension, matrices and data frames have two dimensions, and arrays can have as many dimensions as we want.

When it comes to dimensions, arrays in R are different from arrays in many other languages, where you would have to create an array of arrays to produce a two-dimensional structure, which is not necessary in R.

Vectors

The fundamental data type in R is the vector, which is an ordered collection of values. The first thing you should know is that unlike other languages, single values for numbers, strings, and logicals, are special cases of vectors (vectors of length one), which means that there's no concept of scalars in R. A vector is a one-dimensional data structure and all of its elements are of the same data type.

The simplest way to create a vector is with the c() function, which stands for combine, and coerces all of its arguments into a single type. The coercion will happen from simpler types into more complex types. That is, if we create a vector which contains logicals, numerics, and characters, as the following example shows, our resulting vector will only contain characters, which are the more complex of the three types. If we create a vector that contains logicals and numerics, our resulting vector will be numeric, again because it's the more complex type.

Vectors can be named or unnamed. Unnamed vector elements can only be accessed through positional references, while named vectors can be accessed through positional references as well as name references. In the example below, the y vector is a named vector, where each element is named with a letter from A to I. This means that in the case of x, we can only access elements using their position (the first position is considered as 1 instead of the 0 used in other languages), but in the case of y, we may also use the names we assigned.

Also note that the special values we mentioned before, that is NA, NULL, NaN, and Inf, will be coerced into characters if that's the more complex type, except NA, which stays the same. In case coercion is happening toward numerics, they all stay the same since they are valid numeric values. Finally, if we want to know the length of a vector, simply call the length() function upon it:

x <- c(TRUE, FALSE, -1, 0, 1, "A", "B", NA, NULL, NaN, Inf)
x
#> [1] "TRUE" "FALSE" "-1" "0" "1" "A" "B" NA
#> [9] "NaN" "Inf"
x[1]
#> [1] "TRUE"
x[5]
#> [1] "1"
y <- c(A=TRUE, B=FALSE, C=-1, D=0, E=1, F=NA, G=NULL, H=NaN, I=Inf)
y
#> A B  C D E F  H   I
#> 1 0 -1 0 1 NA NaN Inf
y[1]
#> A
#> 1
y["A"]
#> A
#> 1
y[5]
#> E
#> 1
y["E"]
#> E
#> 1
length(x)
#> [1] 10
length(y)
#> [1] 8

Furthermore, we can select sets or ranges of elements using vectors with index numbers for the values we want to retrieve. For example, using the selector c(1, 2) would retrieve the first two elements of the vector, while using the c(1, 3, 5) would return the first, third, and fifth elements. The : function (yes, it's a function even though we don't normally use the function-like syntax we have seen so far in other examples to call it), is often used as a shortcut to create range selectors. For example, the 1:5 syntax means that we want a vector with elements 1 through 5, which would be equivalent to explicitly using c(1, 2, 3, 4, 5). Furthermore, if we send a vector of logicals, which must have the same length as the vector we want to retrieve values from, each of the logical values will be associated to the corresponding position in the vector we want to retrieve from, and if the corresponding logical is TRUE, the value will be retrieved, but if it's FALSE, it won't be. All of these selection methods are shown in the following example:

x[c(1, 2, 3, 4, 5)]
#> [1] "TRUE" "FALSE" "-1" "0" "1"
x[1:5]
#> [1] "TRUE" "FALSE" "-1" "0" "1"
x[c(1, 3, 5)]
#> [1] "TRUE" "-1" "1"
x[c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, 
    FALSE, TRUE, FALSE, TRUE)]
#> [1] "TRUE" "-1" "1" "B" "NaN" NA

Next we will talk about operation among vectors. In the case of numeric vectors, we can apply operations element-to-element by simply using operators as we normally would. In this case, R will match the elements of the two vectors pairwise and return a vector. The following example shows how two vectors are added, subtracted, multiplied, and divided in an element-to-element way. Furthermore, since we are working with vectors of the same length, we may want to get their dot-product (if you don't know what a dot-product is, you may take a look at https://en.wikipedia.org/wiki/Dot_product), which we can do using the %*% operator, which performs matrix-like multiplications, in this case vector-to-vector:

x <- c(1, 2, 3, 4)
y <- c(5, 6, 7, 8)
x + y
#> [1] 6 8 10 12
x - y
#> [1] -4 -4 -4 -4
x * y
#> [1] 5 12 21 32
x / y
#> [1] 0.2000 0.3333 0.4286 0.5000
x %*% y
#> [,1]
#> [1,] 70

If you want to combine multiple vectors into a single one, you can simply use the c() recursively on them, and it will flatten them for you automatically. Let's say we want to combine the x and y into the z such that the y elements appear first. Furthermore, suppose that after we do we want to sort them, so we apply the sort() function on z:

z <- c(y, x)
z
#> [1] 5 6 7 8 1 2 3 4
sort(z)
#> [1] 1 2 3 4 5 6 7 8

A common source for confusion is how R deals with vectors of different lengths. If we apply an element-to-element operation, as the ones we covered earlier, but using vectors of different lengths, we may expect R to throw an error, as is the case in other languages. However, it does not. Instead, it repeats vector elements in order until they all have the same length. The following example shows three vectors, each of different lengths, and the result of adding them together.

The way R is configured by default, you will actually get a warning message to let you know that the vectors you operated on were not of the same length, but since R can be configured to avoid showing warnings, you should not rely on them:

c(1, 2) + c(3, 4, 5) + c(6, 7, 8, 9)
#> Warning in c(1, 2) + c(3, 4, 5): 
       longer object length is not a multiple of
#> shorter object length
#> Warning in c(1, 2) + c(3, 4, 5) + c(6, 7, 8, 9): 
       longer object length is
#> not a multiple of shorter object length
#> [1] 10 13 14 13

The first thing that may come to mind is that the first vector is expanded into c(1, 2, 1, 2), the second vector is expanded into c(3, 4, 5, 3), and the third one is kept as is, since it's the largest one. Then if we add these vectors together, the result would be c(10, 13, 14, 14). However, as you can see in the example, the result actually is c(10, 13, 14, 13). So, what are we missing? The source of confusion is that R does this step by step, meaning that it will first perform the addition c(1, 2) + c(3, 4, 5), which after being expanded is c(1, 2, 1) + c(3, 4, 5) and results in c(4, 6, 6), then given this result, the next step that R performs is c(4, 6, 6) + c(6, 7, 8, 9), which after being expanded is c(4, 6, 6, 4) + c(6, 7, 8, 9), and that's where the result we get comes from. It can be confusing at first, but just remember to imagine the operations step by step.

Finally, we will briefly mention a very powerful feature in R, known as vectorization. Vectorization means that you apply an operation to a vector at once, instead of independently doing so to each of its elements. This is a feature you should get to know quite well. Programming without it is considered to be bad R code, and not just for syntactic reasons, but because vectorized code takes advantage of many internal optimizations in R, which results in much faster code. We will show different ways of vectorizing code in Chapter 9, Implementing An Efficient Simple Moving Average, and in this chapter, we will see an example, followed by a couple more in following sections.

Even though the phrase vectorized code may seem scary or magical at first, in reality, R makes it quite simple to implement in some cases. For example, we can square each of the elements in the x vector by using the x symbol as if it were a single number. R is intelligent enough to understand that we want to apply the operation to each of the elements in the vector. Many functions in R can be applied using this technique:

x^2
#> [1] 1 4 9 16

We will see more examples that really showcase how vectorization can shine in the following section about functions, where we will see how to apply vectorized operations even when the operations depend on other parameters.

Factors

When analyzing data, it's quite common to encounter categorical values. R provides a good way to represent categorical values using factors, which are created using the factor() function and are integer vectors with associated labels for each integer. The different values that the factor can take are called levels. The levels() function shows all the levels from a factor, and the levels parameter of the factor() function can be used to explicitly define their order, which is alphabetical in case it's not explicitly defined.

Note that defining an explicit order can be important in linear modeling because the first level is used as the baseline level for functions like lm() (linear models), which we will use in Chapter 3, Predicting Votes with Linear Models.

Furthermore, printing a factor shows slightly different information than printing a character vector. In particular, note that the quotes are not shown and that the levels are explicitly printed in order afterwards:

x <- c("Blue", "Red", "Black", "Blue")
y <- factor(c("Blue", "Red", "Black", "Blue"))
z <- factor(c("Blue", "Red", "Black", "Blue"), 
            levels=c("Red", "Black", "Blue"))

x
#> [1] "Blue" "Red" "Black" "Blue"
y
#> [1] Blue Red Black Blue
#> Levels: Black Blue Red
z
#> [1] Blue Red Black Blue
#> Levels: Red Black Blue
levels(y)
#> [1] "Black" "Blue" "Red"
levels(z)
#> [1] "Red" "Black" "Blue"

Factors can sometimes be tricky to work with because their types are interpreted differently depending on what function is used to operate on them. Remember the class() and typeof() functions we used before? When used on factors, they may produce unexpected results. As you can see below, the class() function will identify x and y as being character and factor, respectively. However, the typeof() function will let us know that they are character and integer, respectively. Confusing isn't it? This happens because, as we mentioned, factors are stored internally as integers, and use a mechanism similar to look-up tables to retrieve the actual string associated for each one.

Technically, the way factors store the strings associated with their integer values is through attributes, which is a topic we will touch on in Chapter 8, Object-Oriented System to Track Cryptocurrencies.

class(x)
#> [1] "character"
class(y)
#> [1] "factor"
typeof(x)
#> [1] "character"
typeof(y)
#> [1] "integer"

While factors look and often behave like character vectors, as we mentioned, they are actually integer vectors, so be careful when treating them like strings. Some string methods, like gsub() and grepl(), will coerce factors to characters, while others, like nchar(), will throw an error, and still others, like c(), will use the underlying integer values. For this reason, it's usually best to explicitly convert factors to the data type you need:

gsub("Black", "White", x)
#> [1] "Blue" "Red" "White" "Blue"
gsub("Black", "White", y)
#> [1] "Blue" "Red" "White" "Blue"
nchar(x)
#> [1] 4 3 5 4
nchar(y)
#> Error in nchar(y): 'nchar()' requires a character vector
c(x)
#> [1] "Blue" "Red" "Black" "Blue"
c(y)
#> [1] 2 3 1 2

If you did not notice, the nchar() applied itself to each of the elements in the x factor. The "Blue", "Red", and "Black" strings have 4, 3, and 5 characters, respectively. This is another example of the vectorized operations we mentioned in the vectors section earlier.

Matrices

Matrices are commonly used in mathematics and statistics, and much of R's power comes from the various operations you can perform with them. In R, a matrix is a vector with two additional attributes, the number of rows and the number of columns. And, since matrices are vectors, they are constrained to a single data type.

You can use the matrix() function to create matrices. You may pass it a vector of values, as well as the number of rows and columns the matrix should have. If you specify the vector of values and one of the dimensions, the other one will be calculated for you automatically to be the lowest number that makes sense for the vector you passed. However, you may specify both of them simultaneously if you prefer, which may produce different behavior depending on the vector you passed, as can be seen in the next example.

By default, matrices are constructed column-wise, meaning that the entries can be thought of as starting in the upper-left corner and running down the columns. However, if you prefer to construct it row-wise, you can send the byrow = TRUE parameter. Also, note that you may create an empty or non-initialized matrix, by specifying the number of rows and columns without passing any actual data for its construction, and if you don't specify anything at all, an uninitialized 1-by-1 matrix will be returned. Finally, note that the same element-repetition mechanism we saw for vectors is applied when creating matrices, so do be careful when creating them this way:

matrix()
#> [,1]
#> [1,] NA

matrix(nrow = 2, ncol = 3)
#> [,1] [,2] [,3]
#> [1,] NA NA NA
#> [2,] NA NA NA

matrix(c(1, 2, 3), nrow = 2)
#> Warning in matrix(c(1, 2, 3), nrow = 2): 
       data length [3] is not a sub-
#> multiple or multiple of the number of rows [2]
#> [,1] [,2]
#> [1,] 1 3
#> [2,] 2 1

matrix(c(1, 2, 3), nrow = 2, ncol = 3)
#> [,1] [,2] [,3]
#> [1,] 1 3 2
#> [2,] 2 1 3

matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, byrow = TRUE)
#> [,1] [,2] [,3]
#> [1,] 1 2 3
#> [2,] 4 5 6

Matrix subsets can be specified in various ways. Using matrix-like notation, you can specify the row and column selection using the same mechanisms we showed before for vectors, with which you can use vectors with indexes or vectors with logicals, and in case you decide to use vectors with logicals the vector used to subset must be of the same length as the matrix's dimension you are using it for. Since in this case, we have two dimensions to work with, we must separate the selection for rows and columns by using a comma (,) between them (row selection goes first), and R will return their intersection.

For example, x[1, 2] tells R to get the element in the first row and the second column, x[1:2, 1] tells R to get the first through second elements of the third row, which is equivalent to using x[c(1, 2), 3]. You may also use logical vectors for the selection. For example, x[c(TRUE, FALSE), c(TRUE, FALSE, TRUE)] tells R to get the first row while avoiding the second one, and from that row, to get the first and third columns. An equivalent selection is x[1, c(1, 3)]. Note that when you want to specify a single row or column, you can use an integer by itself, but if you want to specify two or more, then you must use vector notation. Finally, if you leave out one of the dimension specifications, R will interpret as getting all possibilities for that dimension:

x <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3, byrow = TRUE)
x[1, 2]
#> [1] 2
x[1:2, 2]
#> [1] 2 5
x[c(1, 2), 3]
#> [1] 3 6
x[c(TRUE, FALSE), c(TRUE, FALSE, TRUE)]
#> [1] 1 3
x[1, c(1, 3)]
#> [1] 1 3
x[, 1]
#> [1] 1 4
x[1, ]
#> [1] 1 2 3

As mentioned earlier, matrices are basic mathematical tools, and R gives you a lot of flexibility when working with them. The most common matrix operation is transposition, which is performed using the t() function, and matrix-vector multiplication, vector-matrix multiplication, and matrix-matrix multiplication, which are performed with the %*% operator we used previously to calculate the dot-product of two vectors.

Note that the same dimensionality restrictions apply as with mathematical notation, meaning that in case you try to perform one of these operations and the dimensions don't make mathematical sense, R will throw an error, as can be seen in the last part of the example:

A <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, byrow = TRUE)
x <- c(7, 8)
y <- c(9, 10, 11)
A
#> [,1] [,2] [,3]
#> [1,] 1 2 3
#> [2,] 4 5 6
x
#> [1] 7 8
y
#> [1] 9 10 11
t(A)
#> [,1] [,2]
#> [1,] 1 4
#> [2,] 2 5
#> [3,] 3 6
t(x)
#> [,1] [,2]
#> [1,] 7 8
t(y)
#> [,1] [,2] [,3]
#> [1,] 9 10 11
x %*% A
#> [,1] [,2] [,3]
#> [1,] 39 54 69
A %*% t(x)
#> Error in A %*% t(x): non-conformable arguments
A %*% y
#> [,1]
#> [1,] 62
#> [2,] 152
t(y) %*% A
#> Error in t(y) %*% A: non-conformable arguments
A %*% t(A)
#> [,1] [,2]
#> [1,] 14 32
#> [2,] 32 77
t(A) %*% A
#> [,1] [,2] [,3]
#> [1,] 17 22 27
#> [2,] 22 29 36
#> [3,] 27 36 45
A %*% x
#> Error in A %*% x: non-conformable arguments

Lists

A list is an ordered collection of objects, like vectors, but lists can actually combine objects of different types. List elements can contain any type of object that exists in R, including data frames and functions (explained in the following sections). Lists play a central role in R due to their flexibility and they are the basis for data frames, object-oriented programming, and other constructs. Learning to use them properly is a fundamental skill for R programmers, and here, we will barely touch the surface, but you should definitely research them further.

For those familiar with Python, R lists are similar to Python dictionaries.

Lists can be explicitly created using the list() function, which takes an arbitrary number of arguments, and we can refer to each of those elements by both position, and, in case they are specified, also by names. If you want to reference list elements by names, you can use the $ notation.

The following example shows how flexible lists can be. It shows that a list that contains numerics, characters, logicals, matrices, and even other lists (these are known as nested lists), and as you can see, we can extract each of those elements to work independently from them.

This is the first time we show a multi-line expression. As you can see, you can do it to preserve readability and avoid having very long lines in your code. Arranging code this way is considered to be a good practice. If you're typing this directly in the console, plus symbols (+) will appear in each new line, as long as you have an unfinished expression, to guide you along.

x <- list(
    A = 1,
    B = "A",
    C = TRUE,
    D = matrix(c(1, 2, 3, 4), nrow = 2),
    E = list(F = 2, G = "B", H = FALSE)
)

x 
#> $A
#> [1] 1
#>
#> $B
#> [1] "A"
#>
#> $C
#> [1] TRUE
#>
#> $D
#> [,1] [,2]
#> [1,] 1 3
#> [2,] 2 4
#>
#> $E
#> $E$F
#> [1] 2
#>
#> $E$G
#> [1] "B"
#>
#> $E$H
#> [1] FALSE

x[1]
#> $A
#> [1] 1

x$A
#> [1] 1

x[2]
#> $B
#> [1] "A"

x$B
#> [1] "A"

When working with lists, we can use the lapply() function to apply a function to each of the elements in a list. In this case, we want to know the class and type of each of the elements in the list we just created:

lapply(x, class)
#> $A
#> [1] "numeric"
#>
#> $B
#> [1] "character"
#>
#> $C
#> [1] "logical"
#>
#> $D
#> [1] "matrix"
#>
#> $E
#> [1] "list"

lapply(x, typeof)
#> $A
#> [1] "double"
#>
#> $B
#> [1] "character"
#>
#> $C
#> [1] "logical"
#>
#> $D
#> [1] "double"
#>
#> $E
#> [1] "list"

Data frames

Now we turn to data frames, which are a lot like spreadsheets or database tables. In scientific contexts, experiments consist of individual observations (rows), each of which involves several different variables (columns). Often, these variables contain different data types, which would not be possible to store in matrices since they must contain a single data type. A data frame is a natural way to represent such heterogeneous tabular data. Every element within a column must be of the same type, but different elements within a row may be of different types, that's why we say that a data frame is a heterogeneous data structure.

Technically, a data frame is a list whose elements are equal-length vectors, and that's why it permits heterogeneity.

Data frames are usually created by reading in a data using the read.table(), read.csv(), or other similar data-loading functions. However, they can also be created explicitly with the data.frame() function or they can be coerced from other types of objects such as lists. To create a data frame using the data.frame() function, note that we send a vector (which, as we know, must contain elements of a single type) to each of the column names we want our data frame to have, which are A, B, and C in this case. The data frame we create below has four rows (observations) and three variables, with numeric, character, and logical types, respectively. Finally, extract subsets of data using the matrix techniques we saw earlier, but you can also reference columns using the $ operator and then extract elements from them:

x <- data.frame(
    A = c(1, 2, 3, 4),
    B = c("D", "E", "F", "G"),
    C = c(TRUE, FALSE, NA, FALSE)
)
x[1, ]
#> A B C
#> 1 1 D TRUE
x[, 1]
#> [1] 1 2 3 4
x[1:2, 1:2]
#> A B
#> 1 1 D
#> 2 2 E
x$B
#> [1] D E F G
#> Levels: D E F G
x$B[2]
#> [1] E
#> Levels: D E F G

Depending on how the data is organized, the data frame is said to be in either wide or narrow formats (https://en.wikipedia.org/wiki/Wide_and_narrow_data). Finally, if you want to keep only observations for which you have complete cases, meaning only rows that don't contain any NA values for any of the variables, then you should use the complete.cases() function, which returns a logical vector of length equal to the number of rows, and which contains a TRUE value for those rows that don't have any NA values and FALSE for those that have at least one such value.

Note that when we created the x data frame, the C column contains an NA in its third value. If we use the complete.cases() function on x, then we will get a FALSE value for that row and a TRUE value for all others. We can then use this logical vector to subset the data frame just as we have done before with matrices. This can be very useful when analyzing data that may not be clean, and for which you only want to keep those observations for which you have full information:

x
#> A B C
#> 1 1 D TRUE
#> 2 2 E FALSE
#> 3 3 F NA
#> 4 4 G FALSE

complete.cases(x)
#> [1] TRUE TRUE FALSE TRUE
x[complete.cases(x), ]
#> A B C
#> 1 1 D TRUE
#> 2 2 E FALSE
#> 4 4 G FALSE

Divide and conquer with functions

Functions are a fundamental building block of R. To master many of the more advanced techniques in this book, you need a solid foundation in how they work. We've already used a few functions above since you can't really do anything interesting in R without them. They are just what you remember from your mathematics classes, a way to transform inputs into outputs. Specifically in R, a function is an object that takes other objects as inputs, called arguments, and returns an output object. Most functions are in the following form f(argument_1, argument_2, ...). Where f is the name of the function, and argument_1, argument_2, and so on are the arguments to the function.

Before we continue, we should briefly mention the role of curly braces ({}) in R. Often they are used to group a set of operations in the body of a function, but they can also be used in other contexts (as we will see in the case of the web application we will build in Chapter 10, Adding Interactivity with Dashboards). Curly braces are used to evaluate a series of expressions, which are separated by newlines or semicolons, and return only the last expression as a result. For example, the following line only prints the x + y operation to the screen, hiding the output of the x * y operation, which would have been printed had we typed the expressions step by step. In this sense, curly braces are used to encapsulate a set of behavior and only provide the result from the last expression:

{ x <- 1; y <- 2; x * y; x + y }
#> [1] 3

We can create our own function by using the function() constructor and assign it to a symbol. The function() constructor takes an arbitrary number of named arguments, which can be used within the body of the function. Unnamed arguments can also be passed using the "..." argument notation, but that's an advanced technique we won't look at in this book. Feel free to read the documentation for functions to learn more about them.

When calling the function, arguments can be passed by position or by name. The positional order must correspond to the order provided in the function's signature (that is, the function() specification with the corresponding arguments), but when using named arguments, we can send them in whatever order we prefer. As the following example shows.

In the following example, we create a function that calculates the Euclidian distance (https://en.wikipedia.org/wiki/Euclidean_distance) between two numeric vectors, and we show how the order of the arguments can be changed if we use named arguments. To realize this effect, we use the print() function to make sure we can see in the console what R is receiving as the x and y vectors. When developing your own programs, using the print() function in similar ways is very useful to understand what's happening.

Instead of using the function name like euclidian_distance, we will use l2_norm because it's the generalized name for such an operation when working with spaces of arbitrary number dimensions and because it will make a follow-up example easier to understand. Note that even though outside the function call our vectors are called a and b, since they are passed into the x and y arguments, those are the names we need to use within our function. It's easy for beginners to confuse these objects as being the same if we had used the x and y names in both places:

l2_norm <- function(x, y) {
    print("x")
    print(x)
    print("y")
    print(y)
    element_to_element_difference <- x - y
    result <- sum(element_to_element_difference^2)
    return(result)
}

a <- c(1, 2, 3)
b <- c(4, 5, 6)

l2_norm(a, b)
#> [1] "x"
#> [1] 1 2 3
#> [1] "y"
#> [1] 4 5 6
#> [1] 27

l2_norm(b, a)
#> [1] "x"
#> [1] 4 5 6
#> [1] "y"
#> [1] 1 2 3
#> [1] 27

l2_norm(x = a, y = b)
#> [1] "x"
#> [1] 1 2 3
#> [1] "y"
#> [1] 4 5 6
#> [1] 27

l2_norm(y = b, x = a)
#> [1] "x"
#> [1] 1 2 3
#> [1] "y"
#> [1] 4 5 6
#> [1] 27

Functions may use the return() function to specify the value returned by the function. However, R will simply return the last evaluated expression as the result of a function, so you may see code that does not make use of the return() function explicitly.

Our previous l2_norm() function implementation seems to be somewhat cluttered. If the function has a single expression, then we can avoid using the curly braces, which we can achieve by removing the print() function calls and avoid creating intermediate objects, and since we know that it's working fine, we can do so without hesitation. Furthermore, we avoid explicitly calling the return() function to simplify our code even more. If we do so, our function looks much closer to its mathematical definition and is easier to understand, isn't it?

l2_norm <- function(x, y) sum((x - y)^2)

Furthermore, in case you did not notice, since we use vectorized operations, we can send vectors of different lengths (dimensions), provided that both vectors share the same length, and the function will work just as we expect it to, without regard for the dimensionality of the space we're working with. As I had mentioned earlier, vectorization can be quite powerful. In the following example, we show such behavior with vectors of dimension 1 (mathematically known as scalars), as well as vectors of dimension 5, created with the ":" shortcut syntax:

l2_norm(1, 2)
#> [1] 1
l2_norm(1:5, 6:10)
#> [1] 125

Before we move on, I just want to mention that you should always make an effort to follow the Single Responsibility principle, which states that each object (functions in this case) should focus on doing a single thing, and do it very well. Whenever you describe a function you created as doing "something" and "something else," you're probably doing it wrong since the "and" should let you know that the function is doing more than one thing, and you should split it into two or more functions that possibly call each other. To read more about good software engineering principles, take a look at Martin's great book title Agile Software Development, Principles, Patterns, and Practices, Pearson, 2002.

Optional arguments

When creating functions, you may specify a default value for an argument, and if you do, then the argument is considered optional. If you do not specify a default value for an argument, and you do not specify a value when calling a function, you will get an error if the function attempts to use the argument.

In the following example, we show that if a single numeric vector is passed to our l2_norm() function as it stands, it will throw an error, but if we redefine it to make the second vector optional, then we will simply return the first vector's norm, not the distance between two different vectors To accomplish this, we will provide a zero-vector of length one, but because R repeats vector elements until all the vectors involved in an operation are of the same length, as we saw before in this chapter, it will automatically expand our zero-vector into the appropriate dimension:

l2_norm(a)     # Should throw an error because `y` is missing
#> Error in l2_norm(a): argument "y" is missing, with no default

l2_norm <- function(x, y = 0) sum((x - y)^2)

l2_norm(a)     # Should work fine, since `y` is optional now
#> [1] 14
l2_norm(a, b)  # Should work just as before
#> [1] 27

As you can see, now our function can optionally receive the y vector, but will also work as expected without it. Also, note that we introduced some comments into our code. Anything that comes after the # symbol in a line, R will ignore, which allows us to explain our code where need be. I prefer to avoid using comments because I tend to think that code should be expressive and communicate its intention without the need for comments, but they are actually useful every now and then.

Functions as arguments

Sometimes when you want to generalize functions, you may want to plug in a certain functionality into a function. You can do that in various ways. For example, you may use conditionals, as we will see in the following section in this chapter, to provide them with different functionality based on context. However, conditional should be avoided when possible because they can introduce unnecessary complexity into our code. A better solution would be to pass a function as a parameter which will be called when appropriate, and if we want to change how a function behaves, we can change the function we're passing through for a specific task.

That may sound complicated, but in reality, it's very simple. Let's start by creating a l1_norm() function that calculates the distance between two vectors but uses the sum of absolute differences among corresponding coordinates instead of the sum of squared differences as our l2_norm() function does. For more information, take a look at the Taxicab geometry article on Wikipedia (https://en.wikipedia.org/wiki/Taxicab_geometry).

Note that we use the same signature for our two functions, meaning that both receive the same required as well as optional arguments, which are x and y in this case. This is important because if we want to change the behavior by switching functions, we must make sure they are able to work with the same inputs, otherwise, we may get unexpected results or even errors:

l1_norm <- function(x, y = 0) sum(abs(x - y))

l1_norm(a)
#> [1] 6
l1_norm(a, b)
#> [1] 9

Now that our l2_norm() and l1_norm() are built so that they can be switched among themselves to provide different behavior, we will create a third distance() function, which will take the two vectors as arguments, but will also receive a norm argument, which will contain the function we want to use to calculate the distance.

Note that we are specifying that we want to use the l2_norm() by default in case there's no explicit selection when calling the function, and to do so we simply specify the symbol that contains the function object, without parenthesis. Finally note, that if we want to avoid sending the y vector, but we want to specify what norm should be used, then we must pass it through as a named argument, otherwise R would interpret the second argument as the y vector, not the norm function:

distance <- function(x, y = 0, norm = l2_norm) norm(x, y)

distance(a)
#> [1] 14
distance(a, b)
#> [1] 27
distance(a, b, l2_norm)
#> [1] 27
distance(a, b, l1_norm)
#> [1] 9
distance(a, norm = l1_norm)
#> [1] 6

Operators are functions

Now that you have a working understanding of how functions work. You should know that not all function calls look like the ones we have shown so far, where you use the name of the function followed by parentheses that contains the function's arguments. Actually, all statements in R, including setting variables and arithmetic operations, are functions in the background, even if we mostly call them with a different syntax.

Remember that previously in this chapter we mentioned that R objects could be referred to by almost any string, but you should avoid doing so. Well here we show how using cryptic names can be useful under certain contexts. The following example shows how the assignment, selection, and addition operators are usually used with sugar syntax (a term used to describe syntax that exists for ease of use), but that in the background they use the functions named [<-, [, and +, respectively.

The [<-() function receives three arguments: the vector we want to modify, the position we want to modify in the vector, and the value we want it to have at that position. The [() function receives two arguments, the vector from which we want to retrieve a value and the position of the value we want to retrieve. Finally, the +() function receives the two values we want to add. The following example shows the syntax sugar, followed by the background function calls R performs for us:

x <- c(1, 2, 3, 4, 5)
x
#> [1] 1 2 3 4 5
x[1] <- 10
x
#> [1] 10 2 3 4 5
`[<-`(x, 1, 20)
#> [1] 20 2 3 4 5
x
#> [1] 10 2 3 4 5
x[1]
#> [1] 10
`[`(x, 1)
#> [1] 10
x[1] + x[2]
#> [1] 12
`+`(x[1], x[2])
#> [1] 12
`+`(`[`(x, 1), `[`(x, 1))
#> [1] 20

In practice, you would probably never write these statements as explicit function calls. The syntax sugar is much more intuitive and much easier to read. However, to use some of the advanced techniques shown in this book, it is helpful to know that every operation in R is a function.

Coercion

Finally, we will briefly mention what coercion is in R since it's a topic of confusion for newcomers. When you call a function with an argument of a different type than what was expected, R will try to coerce values so that the function will work, and this can introduce bugs if not handled correctly. R will follow a mechanism similar to what was used when creating vectors.

Strongly typed languages (like Java) will raise exceptions when the object passed to a function is of the wrong type, and will try not to convert the object to a compatible type. However, as we mentioned earlier, R was designed to work out of the box with a lot of unforeseen contexts, so coercion was introduced.

In the following example, we show that if we call our distance() function and pass logical vectors instead of numeric ones, R will coerce the logical vectors into numeric vectors, using TRUE as 1 and FALSE as 0, and proceed with the calculations. To avoid this issue in your own programs, you should coerce data types explicitly with the as.*() functions we mentioned before:

x <- c(1, 2, 3)
y <- c(TRUE, FALSE, TRUE)
distance(x, y)
#> [1] 8

Complex logic with control structures

The final topic we should cover is how to introduce complex logic by using control structures. When I write introduce complex logic, I don't mean to imply that it's complex to do so. Complex logic refers to code that has multiple possible paths of execution, but in reality, it's quite simple to implement it.

Nearly every operation in R can be written as a function, and these functions can be passed through to other functions to create very complex behavior. However, it isn't always convenient to implement logic that way and using simple control structures may be a better option sometimes.

The control structures we will look at are if... else conditionals, for loops, and while loops. There are also switch conditionals, which are very much like if... else conditionals, but we won't look at them since we won't use them in the examples contained in this book.

If… else conditionals

As their name states, if…else conditionals will check a condition, and if it is evaluated to be a TRUE value, one path of execution will be taken, but if the condition is evaluated to be a FALSE value, a different path of execution will be taken, and they are mutually exclusive.

To show how if... else conditions work, we will program the same distance() function we used before, but instead of passing it the third argument in the form of a function, we will pass it a string that will be checked to decide which function should be used. This way you can compare different ways of implementing the same functionality. If we pass the l2 string to the norm argument, then the l2_norm() function will be used, but if any other string is passed through, the l1_norm() will be used. Note that we use the double equals operator (==) to check for equality. Don't confuse this with a single equals, which means assignment:

distance <- function(x, y = 0, norm = "l2") {
    if (norm == "l2") {
        return(l2_norm(x, y))
    } else {
        return(l1_norm(x, y))
    }
}

a <- c(1, 2, 3)
b <- c(4, 5, 6)

distance(a, b)
#> 27
distance(a, b, "l2")
#> 27
distance(a, b, "l1")
#> 9
distance(a, b, "l1 will also be used in this case")
#> 9

As can be seen in the last line of the previous example, using conditionals in a non-rigorous manner can introduce potential bugs, as in this case we used the l1_norm() function, even when the norm argument in the last function call did not make any sense at all. To avoid such situations, we may introduce the more conditionals to exhaust all valid possibilities and throw an error, with the stop() function, if the else branch is executed, which would mean that no valid option was provided:

distance <- function(x, y = 0, norm = "l2") {
    if (norm == "l2") {
        return(l2_norm(x, y))
    } else if (norm == "l1") {
        return(l1_norm(x, y))
    } else {
        stop("Invalid norm option")
    }
}

distance(a, b, "l1")
#> [1] 9
distance(a, b, "this will produce an error")
#> Error in distance(a, b, "this will produce an error") :
#>   Invalid norm option

Sometimes, there's no need for the else part of the if... else condition. In that case, you can simply avoid putting it in, and R will execute the if branch if the condition is met and will ignore it if it's not.

There are many different ways to generate the logical values that can be used within the if() check. For example, you could specify an optional argument with a NULL default value and check whether it was not sent in the function call by checking whether the corresponding variable still contains the NULL object at the time of the check, using the is.null() function. The actual condition would look something like if(is.null(optional_argument)). Other times you may get a logical vector, and if a single one of its values is TRUE, then you want to execute a piece of code, in that case you can use something like if(any(logical_vector)) as the condition, or in case you require that all of the values in the logical vector are TRUE to execute a piece of code, then you can use something like if(all(logical_vector)). The same logic can be applied to the self-descriptive functions named is.na() and is.nan().

Another way to generate these logical values is using the comparison operators. These include less than (<), less than or equal to (<=), greater than (>), greater than or equal to (>=), exactly equal (which we have seen ,==), and not equal to (!=). All of these can be used to test numerics as well as characters, in which case alphanumerical order is used. Furthermore, logical values can be combined among themselves to provide more complex conditions. For example, the ! operator will negate a logical, meaning that if !TRUE is equal to FALSE, and !FALSE is equal to TRUE. Other examples of these types of operators are the OR operator where in case any of the logical values is TRUE, then the whole expression evaluates to TRUE, and the AND operator where all logical must be TRUE to evaluate to TRUE. Even though we don't show specific examples of the information mentioned in the last two paragraphs, you will see it used in the examples we will develop in the rest of the book.

Finally, note that a vectorized form of the if... else conditional is available under the ifelse() function. In the following code we use the modulo operator in the conditional, which is the first argument to the function, to identify which values are even, in which case we use the TRUE branch which is the second argument to indicate that the integer is even, and which are not, in which case we use the FALSE branch which is the third argument to indicate that the integer is odd:

ifelse(c(1, 2, 3, 4, 5, 6) %% 2 == 0, "even", "odd")
#> [1] "odd" "even" "odd" "even" "odd" "even"

For loops

There are two important properties of for loops. First, results are not printed inside a loop unless you explicitly call the print() function. Second, the indexing variable used within a for loop will be changed, in order, after each iteration. Furthermore, to stop iterating you can use the keyword break, and to skip to the next iteration you can use the next command.

For this first example, we create a vector of characters called words, and iterate through each of its elements in order using the for (word in words) syntax. Doing so will take the first element in words, assign it to word, and pass it through the expression defined in the block defined by the curly braces, which in this case print the word to the console, as well as the number of characters in the word. When the iteration is finished, word will be updated with the next word, and the loop will be repeated this way until all words have been used:

words <- c("Hello", "there", "dear", "reader")
for (word in words) {
    print(word)
    print(nchar(word))
}
#> [1] "Hello"
#> [1] 5
#> [1] "there"
#> [1] 5
#> [1] "dear"
#> [1] 4
#> [1] "reader"
#> [1] 6

Interesting behavior can be achieved by using nested for loops which are for loops inside other for loops. In this case, the same logic applies, when we encounter a for loop we execute it until completion. It's easier to see the result of such behavior than explaining it, so take a look at the behavior of the following code:

for (i in 1:5) {
    print(i)
    for (j in 1:3) {
        print(paste("   ", j))
    }
}
#> [1] 1
#> [1] " 1"
#> [1] " 2"
#> [1] " 3"
#> [1] 2
#> [1] " 1"
#> [1] " 2"
#> [1] " 3"
#> [1] 3
#> [1] " 1"
#> [1] " 2"
#> [1] " 3"
#> [1] 4
#> [1] " 1"
#> [1] " 2"
#> [1] " 3"
#> [1] 5
#> [1] " 1"
#> [1] " 2"
#> [1] " 3"

Using such nested for loops is how people perform matrix-like operations when using languages that do not offer vectorized operations. Luckily, we can use the syntax shown in previous sections to perform those operations without having to use nested for-loops ourselves which can be tricky at times.

Now, we will see how to use the sapply() and lapply() functions to apply a function to each element of a vector. In this case, we will call use the nchar() function on each of the elements in the words vector we created before. The difference between the sapply() and the lapply() functions is that the first one returns a vector, while the second returns a list. Finally, note that explicitly using any of these functions is unnecessary, since, as we have seen before in this chapter, the nchar() function is already vectorized for us:

sapply(words, nchar)
#> Hello there dear reader
#> 5     5     4    6
lapply(words, nchar)
#> [[1]]
#> [1] 5
#>
#> [[2]]
#> [1] 5
#>
#> [[3]]
#> [1] 4
#>
#> [[4]]
#> [1] 6
nchar(words)
#> [1] 5 5 4 6

When you have a function that has not been vectorized, like our distance() function. You can still use it in a vectorized way by making use of the functions we just mentioned. In this case we will apply it to the x list which contains three different numeric vectors. We will use the lapply() function by passing it the list, followed by the function we want to apply to each of its elements (distance() in this case). Note that in case the function you are using receives other arguments apart from the one that will be taken from x and which will be passed as the first argument to such function, you can pass them through after the function name, like we do here with the c(1, 1, 1) and l1_norm arguments, which will be received by the distance() function as the y and norm arguments, and will remain fixed for all the elements of the x list:

x <- list(c(1, 2, 3), c(4, 5, 6), c(7, 8, 9))
lapply(x, distance, c(1, 1, 1), l1_norm)
#> [[1]]
#> [1] 3
#>
#> [[2]]
#> [1] 12
#>
#> [[3]]
#> [1] 21

While loops

Finally, we will take a look at the while loops which use a different way of looping than for loops. In the case of for loops, we know the number of elements in the object we use to iterate, so we know in advance the number of iterations that will be performed. However, there are times where we don't know this number before we start iterating, and instead, we will iterate based on some condition being true after each iteration. That's when while loops are useful.

The way while loops work is that we specify a condition, just as with if…else conditions, and if the condition is met, then we proceed to iterate. When the iteration is finished, we check the condition again, and if it continues to be true, then we iterate again, and so on. Note that in this case if we want to stop at some point, we must modify the elements used in the condition such that it evaluates to FALSE at some point. You can also use break and next inside the while loops.

The following example shows how to print all integers starting at 1 and until 10. Note that if we start at 1 as we do, but instead of adding 1 after each iteration, we subtracted 1 or didn't change x at all, then we would never stop iterating. That's why you need to be very careful when using while loops since the number of iterations can be infinite:

x <- 1
while (x <= 10) {
    print(x)
    x <- x + 1
}
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
#> [1] 5
#> [1] 6
#> [1] 7
#> [1] 8
#> [1] 9
#> [1] 10

In case you do want to execute an infinite loop, you may use the while loop with a TRUE value in the place of the conditional. If you do not include a break command, the code will effectively provide an infinite loop, and it will repeat itself until stopped with the CTRL + C keyboard command or any other stopping mechanism in the IDE you're using. However, in such cases, it's cleaner to use the repeat construct as is shown below. It may seem counter intuitive, but there are times when using infinite loops is useful. We will see one such case in Chapter 8, Object-Oriented System to Track Cryptocurrencies, but in such cases, you have an external mechanism used to stop the program based on a condition external to R.

Executing the following example will crash your R session:

# DO NOTE EXCEUTE THIS, IT's AN INFINITE LOOP

x <- 1
repeat {
    print(x)
    x <- x + 1
}

#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
#> [1] 5
#> [1] 5
...

The examples in this book

To end this introductory chapter, I want to introduce you to the three examples we will develop throughout the rest of the book. The first one is the Brexit Votes example, in which we are going to use real Brexit votes data, and, with descriptive statistics and linear models, we will attempt to understand the population dynamics at play behind the results. If you're not familiar with Brexit, it is the popular term for the prospective withdrawal of the United Kingdom from the European Union after a referendum which took place on June 23, 2016 (https://en.wikipedia.org/wiki/Brexit). This example will be developed through Chapter 2, Understanding Votes with Descriptive Statistics, and Chapter 3, Predicting Votes with Linear Models.

The second one is The Food Factory example, in which you will learn how to simulate various kinds of data for a hypothetical company called The Food Factory, as well as integrate real data from other sources (customer reviews in this case) to complement our simulations. The data will be used to develop various kinds of visualizations, text analysis, and presentations that are updated automatically. This example will be developed through; Chapter 4, Simulating Sales Data and Working with Databases; Chapter 5, Communicating Sales with Visualizations; Chapter 6, Understanding Reviews with Text Analysis; and Chapter 7, Developing Automatic Presentations.

The third and final one is the Cryptocurrencies Tracking System example, in which we will develop an object-oriented system that will be used to retrieve real-time price data from cryptocurrency markets and the amount of cryptocurrencies assets we hold. We will then show how to compute a simple moving average efficiently using performance optimization techniques, and finally we will show how to build interactive web applications using only R. This example will be developed through Chapter 8, Object-Oriented System to Track Cryptocurrencies; Chapter 9, Implementing an Efficient Simple Moving Average; and Chapter 10, Adding Interactivity with Dashboards.

Summary

In this chapter, we introduced the book by mentioning its intended audience, as well as our intentions for it, which are to provide examples that you can use to understand how real-world R applications are built using a high-quality code, and the useful guidelines of what to do and not to do when building your own applications.

We also introduced R's basic constructs and prepared the baseline we need to work through the examples developed in the rest of the book. Specifically, we looked at how to work with the console, how to create and use variables, how to work with R basic data types like numerics, characters, and logicals, as well as how to handle special values, and how to make basic use of data structures like vectors, factors, matrices, data frames, and lists. Finally, we showed how to create our own functions and how to provide multiple paths of execution with control structures.

I hope this book is useful to you and that you enjoy reading it.

About the Author

Omar Trejo Navarro

Omar Trejo Navarro is a data consultant. He co-founded Datata, is actively working on CVEST, and maintains a personal website (otrenav.com). He is an applied mathematics and economics double major from ITAM in Mexico City, where he continues to work as a research assistant. He does software development with a focus on data platforms, data science, and web applications. He has worked with clients from all over the world, and is a keen supporter of open source, open data, and open science in general.
Browse publications by this author

If you know another language (in my case Fortran) and want to quickly get started with R then this is the book for you. The R process for designing and writing programs is quite different for those of us with a more structured compiled language background. This book will quickly make you aware of this and the advantage for doing some quick data analysis.

PRices ok. Found most of what I was looking for

Brilliant material for learning R, I learn best in all fields through examples and this is top notch, could do with more like this.