Home Big-data-and-business-intelligence Data Manipulation with R - Second Edition

Data Manipulation with R - Second Edition

By Jaynal Abedin , Kishor Kumar Das
books-svg-icon Book
Subscription
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
Subscription
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
About this book

This book starts with the installation of R and how to go about using R and its libraries. We then discuss the mode of R objects and its classes and then highlight different R data types with their basic operations.

The primary focus on group-wise data manipulation with the split-apply-combine strategy has been explained with specific examples. The book also contains coverage of some specific libraries such as lubridate, reshape2, plyr, dplyr, stringr, and sqldf. You will not only learn about group-wise data manipulation, but also learn how to efficiently handle date, string, and factor variables along with different layouts of datasets using the reshape2 package.

By the end of this book, you will have learned about text manipulation using stringr, how to extract data from twitter using twitteR library, how to clean raw data, and how to structure your raw data for data mining.

Publication date:
March 2015
Publisher
Packt
Pages
130
ISBN
9781785288814

 

Chapter 1. Introduction to R Data Types and Basic Operations

R is an object-oriented programming language and an environment that is a variation of the S language written by Ross Ihaka and Robert Gentlemen (hence, the name R). What can we do using R? The answer is we can do anything we can think of that is logical and/or structural. With R, we can perform data processing, write functions, produce graphs, perform complex data analysis, and also produce our own customized packages (a collection of functions to perform specified tasks) to solve specific problems. We can develop up-to-date statistical techniques through R packages. Most importantly, R is open source and is a freely available software that will remain free.

Assuming that readers have very preliminary or no knowledge of R, the layout of this chapter is divided in to two major sections; the first one will be an introduction to R, and the second major section will relate to data types and basic operations.

The following are the reasons to use R:

  • R is free: It comes with a license, but we do not have to pay anything to get it. It is not only free, but also open source. We can see the source code, change it as per our own requirements, and also distribute it without violating the license. Academicians across different disciplines around the world reviewed the core of the R system and also contributed to make it better.

  • R is a powerful software: It is used to perform data processing and data analysis, and to produce a variety of graphs. All the necessary functions for data processing are available in R. It has a substantial collection of libraries (a library is a collection of functions to perform certain types of task), which are written by researchers working in a variety of fields. That is why, whether you are a statistician, biologist, environmentalist, or data scientist, you should find a set of functions that serves your purpose. The graphic system in R is one of the most powerful tools in this era. We have full control over every part of graphs produced in R.

  • R is up-to-date: R is now one of the standard platforms to implement our research work. We should be able to find an R package suitable for the most recent developments, whatever our field is.

  • R is a community: R is being developed by a team of volunteers. Also, it includes large communities that are writing new functions every day and that can help us out if we face any problem.

  • R is the language of communication: R is now becoming a prominent way of sharing new findings with other researchers in this field.

Here is a summary of why we should use R:

  • R is free, and it will remain free.

  • It involves up-to-date implementation of recent statistical techniques.

  • There is flexibility. The user has control over each and every part of a dataset and each component of each output.

  • It is customizable based on the user's need.

  • It has a large number of built-in libraries.

  • It has a cloud-computing feature.

  • It has rich graphics.

  • It has a wide range of flexible data structures.

  • It intelligently handles missing values.

 

Getting different versions of R


The source code, documentation, and other related files are maintained in the Comprehensive R Archive Network (CRAN), which can be found at http://cran.r-project.org/. CRAN is a collection of websites that contain identical materials consisting of the R distributions, contributed extensions, and documentation for R and binaries. The user can select anyone of the CRAN sites to download the R software. The user can download the software that is compatible to their computer's platform such as Windows, Mac, and Linux.

To download binaries for different platforms, anyone can use the following links:

The preceding links are applicable to download the most recent version of R. The latest R Version 3.1.2 (Pumpkin Helmet) was released on October 31, 2014.

To get the old version of R, Windows users can look at the various releases at http://cran.r-project.org/bin/windows/base/old/, and Mac users can look at http://cran.r0-project.org/bin/macosx/old/ to download the desired one.

 

Installing R on different platforms


To install R on various platforms, the first requirement is to download appropriate binaries that are compatible with the relevant platform. In this section, we will briefly discuss installation on the Windows platform and will refer users to http://cran.r-project.org/doc/manuals/r-release/R-admin.html for the documentation for alternative platforms.

Installing R under Windows is as easy as installing any other software. After downloading the binary file for Windows (it comes with an .exe file), the name is for example, R-3.1.2-win.exe. This executable file contains binaries for a base distribution and a large number of add-on packages from CRAN. Users can install it just by double-clicking on the file and following the on-screen instructions. There is no special care that needs to be taken during installation; just go with the default selections.

 

Installing and using R libraries


R comes with a number of default packages, a collection of previously programmed functions for specific tasks, and with datasets. This is usually known as a library, but the R community refers to it as a package. There are two types of R packages:

  • Default packages that come with the R executable

  • Add-on packages that do not come during installation; we need to install them manually on downloading

When we open the R console, it automatically loads its default packages with the associated functions, and we do not need to load those packages manually. A list of installed packages can be obtained by typing library() in the R console. However, some of the packages need to load to execute functions. To load a specific package, the corresponding R command is library(package), where package is the name of any library such as plyr, provided that the package has already been installed.

In some situations, we may require a special type of data processing and analysis. If the corresponding packages are not available in the default list, we need to install them. For example, the plyr package is not in the default list, so we need to install it separately.

There are two different ways to install a package:

  • By manually downloading and installing it

  • Installing it from within R

Manually downloading and installing packages

To download a package from CRAN and install it, follow these steps:

  1. Go to http://www.r-project.org/.

  2. Click on CRAN mirror under the Getting Started section.

  3. Select any one of the regional servers from the list; for example, select the server from Austria at http://cran.at.r-project.org/.

  4. Click on Contributed extension packages under the Source Code for all Platforms section.

  5. Select Table of available packages, sorted by date of publication or Table of available packages, sorted by name and then download the desired package from the list.

  6. While downloading, users need to choose the file that matches with the platform; for example, a Windows user will download the binary zip file.

  7. Once the download is completed, open R.

  8. Go to the Packages menu and select Install packages from local zip files.

Tip

One potential problem with manual downloads is that, sometimes, particular packages are dependent on other packages that are not included in the manual process of installation. To avoid this problem, we can install the desired package(s) from the R shell, as installing package(s) from the R shell resolves dependencies.

Installing packages within the R shell

To install a package from within the R console, we can use the install.packages() command; this command will prompt us to select the appropriate server CRAN. Note that to install packages using this approach, the computer must have active Internet connection.

For example, to install the plyr package, we can use the following command:

install.packages("plyr")

The previous command will prompt us to select a regional server and, after selecting the server from the available list, the package will be installed on the local computer.

 

Comparing R with other software


A growing number of libraries, currently more than 6,000, is the most noticeable feature of R, compared to other commercial software such as SAS, Stata, SPSS, and open source software such as Python and Octave. This feature enables R to have a huge number of tools for data management and statistical analysis. Data management capability is very limited in SPSS and Octave. The capability of R's data management is only comparable with commercial software such as SAS and open source software such as Python. R has no competitor that gets the most up-to-date packages for analysis in many areas such as finance, mathematics, data mining, machine learning, or even astronomy. Recently developed statistical analysis techniques are found in Python and Octave, but it took a while to get them in SPSS, Stata, and SAS.

R has a more intuitive syntax structure than the previously mentioned software. Its object-oriented features make it more flexible than SPSS, Stata, SAS, and Octave. Python shares the object-oriented features too, but it is less flexible than R. Open source software is designed to be developed by volunteer developers and offer very easy-to-implement function-writing capabilities. Although it is easy to write a function in Python and Octave, writing functions in R is even easier.

R has one of the best graphics systems among all existing software. The grammar graphics implemented in the ggplot2 package makes it the most popular library for producing a variety of graphs with excellent quality. It is comparatively easy to modify all the components of a graph in R, compared to SPSS, Stata, SAS, Octave, and Python.

SPSS is very easy to use at first for some basic analysis, but when it comes to data management, scripting, and complex statistical analysis, sometimes it fails, and sometimes, it is very hard to implement. Learning Stata is very easy for basic data management tools, but if we want to do a complex data management function, it is very hard to implement. R has a very steep learning curve like Python, Octave, and SAS. However, unlike Octave and SAS, we can find a large number of freely available resources and tutorials on the Web to get help. These resources can make our learning easier compared to other software.

 

R as an enterprise solution


Revolution Analytics (http://www.revolutionanalytics.com/) is a statistical software company focused on developing open core versions of R, for enterprise, academic, and analytics customers. This type of enterprise solution supports big data analytics, various types of complex modeling of real-world problems, and day-to-day activities in big enterprises.

 

Writing commands in R


The R programming language is basically command-line (interpreter-type) programming. We can perform any type of mathematical and statistical calculation, including data management analysis and graphics in the command line. The R command window is known as the R console, where the command and the results are produced upon execution of a given command.

Here is a very basic example of using the R console:

> (44/55)*100
[1] 80
> log(25)
[1] 3.218876
> log10(25)
[1] 1.39794
> exp(0.23)
[1] 1.2586
> 453/365.25
[1] 1.240246
> 1-5*0.2
[1] 0
> 1-0.2-0.2-0.2-0.2-0.2 # An interesting calculation 
[1] 5.551115e-17

Using the R console, we can perform any type of calculation, but we always need to preserve the code to reproduce the result of any scientific analysis. From this perspective, the R console is not user-friendly when it comes to saving commands. To save the necessary commands for future use and to ensure reproducibility of research results, R has a command editor, which is known as the script editor. The script editor is just like a plain text editor. We can preserve code and comments in R script files. The R console allows only one line of command at a time, and it executes as soon as we enter. However, in the script file, we can run a batch of code at a time. To write any type of comment related to any analysis in R, we can place a # (hash) sign as the starting character. Here is an example:

# This is a comment line
 

R data types and basic operations


In this major section of the chapter, we will introduce data types and structure and how to convert one type to another with very simple functions.

Modes and classes of R objects

Whatever we do in R, is stored as objects. An R object is anything that can be assigned to a variable of interest. This could be a single number or a set of numbers, characters, and special characters, for example, TRUE, FALSE, NA, NaN, and Inf. Also, these can be already defined in R as functions, such as seq (to generate a sequence of numbers with a specified increment), names (to extract names such as variable names from a dataset), row.names (to extract the row names of the data, if any), or col.names (this is equivalent to names, and it extracts column names from a matrix or data frame).

Some examples of R objects are as shown in the following code:

# Constant
> 2
[1] 2
> "July"
[1] "July"
> NULL
NULL
> NA
[1] NA
> NaN
[1] NaN
> Inf
[1] Inf
# Object can be created from existing object
# to make the result reproducible means every time we run the 
# following code we will get the same results # we need to set # a seed value
> set.seed(123)  
> rnorm(9)+runif(9)
[1] -0.2325549  0.7243262  2.4482476  0.7633118  0.7697945  2.7093348  1.1166220 -0.5565308 -0.1427868

One important thing about objects in R is that, if we do not assign an object to any variable, we will not be able to reuse it, and it does not store the object internally. In the preceding example, all are different objects, but they are not assigned to any variable. So, they are not stored, and we cannot use them later, until we enter the object's value itself. Thus, whenever we deal with an object, we will assign it to an appropriate variable; interestingly, the assigned variable is also an object in R!

To assign an object in R to a variable, we can define the variable name in various ways, such as lowercase, uppercase, a combination of uppercase and lowercase, or even a combination of uppercase, lowercase, a number, and/or a dot. However, there are some rules to define variable names. For example, the name cannot start with numbers; it must start with a character or an underscore. There is no special character allowed in variable names, such as @, #, $, and *. Though R does not have a standard guideline for naming conventions, according to Bååth (in the paper The State of Naming Conventions in R, which can be found at http://journal.r-project.org/archive/2012-2/RJournal_2012-2_Baaaath.pdf), the most popular naming convention for functions is lowerCamelCase, while the most popular naming convention for arguments separates them by a period. For a variable name, we can use the same naming convention as that of arguments, but again, there is no strict rule for naming conventions in R.

The following table is constructed from the same article by Bååth to give you an idea of the different naming conventions used in R and their popularity:

Object type

Naming conventions

Percentage

Function

lowerCamelCase

55.5

period.separated

51.8

underscore_separated

37.4

singlelowercaseword

32.2

_OTHER.conventions

12.8

UpperCamelCase

6.9

Parameter (argument)

period.separated

82.8

lowerCamelCase

75.0

underscore_separated

70.7

singlelowercaseword

69.6

_OTHER.conventions

9.7

UpperCamelCase

2.4

Once we store the R object into a variable, it is still treated as an R object. Each and every object in R has some attributes to describe the nature of the information contained in it. The mode and class are the most important attributes of an R object. Commonly encountered modes of an individual R object are numeric, character, and logical. When we work with data in R, problems may arise due to incorrect operations in incorrect object modes. So, before working with data, we should study the mode; we need to know what type of operation is applicable.

The mode function returns the mode of R objects.

The following example code describes how we can investigate the mode of an R object:

# Storing R object into a variable and then viewing the mode

> num.obj <- seq(from=1,to=10,by=2)
mode(num.obj)
[1] "numeric"

> logical.obj<-c(TRUE,TRUE,FALSE,TRUE,FALSE)
> mode(logical.obj)
[1] "logical"
> character.obj <- c("a","b","c")
> mode(character.obj)
[1] "character"

For the numeric mode, R stores all numeric objects into either a 32-bit integer or a double-precision floating point.

If an R object contains both numeric and logical elements, the mode of that object will be numeric and, in this case, the logical element automatically gets converted to a numeric element. The logical element TRUE converts to 1 and FALSE converts to 0. On the other hand, if any R object contains a character element, along with both numeric and logical elements, it automatically converts to the character mode.

Let's have a look at the following code:

# R object containing both numeric and logical element
> xz <- c(1, 3, TRUE, 5, FALSE, 9)
> xz
[1] 1 3 1 5 0 9
> mode(xz)
[1] "numeric"

# R object containing character, numeric, and logical elements
> xw <- c(1,2,TRUE,FALSE,"a")
> xw
[1] "1"     "2"     "TRUE"  "FALSE" "a"    
> mode(xw)
[1] "character"

The mode() function is not the only way to test R object modes. There are alternative ways too: is.numeric(), is.charater(), and is.logical(), as shown in the following code. The output of these functions is always logical:

> num.obj <- seq(from=1,to=10,by=2)
> logical.obj<-c(TRUE,TRUE,FALSE,TRUE,FALSE)
> character.obj <- c("a","b","c")

> is.numeric(num.obj)
[1] TRUE
> is.logical(num.obj)
[1] FALSE
> is.character(num.obj)
[1] FALSE

Other than these three modes (numeric, logical, and character) of objects, another frequently encountered mode is function. Here is an example:

> mode(mean)
[1] "function"
# Also we can test whether "mean" is function or not as follows
> is.function(mean)
[1] TRUE

The class() function provides the class information of an R object. The primary purpose of the class() function is to know how different functions, including generic functions, work. For example, with the class information, the generic function print or plot knows what to do with a particular R object. To assess the class information of the object created earlier, we can use the class() function. Let's have a look at the following code:

> num.obj <- seq(from=1,to=10,by=2)
> logical.obj<-c(TRUE,TRUE,FALSE,TRUE,FALSE)
> character.obj <- c("a","b","c")

> class(num.obj)
[1] "numeric"

> class(logical.obj)
[1] "logical"

> class(character.obj)
[1] "character"

Although we can easily assess the mode and class of an R object through mode() and class(), there is another collection of R commands that is also used to assess whether a particular object belongs to a certain class. These functions start with is.; for example, is.numeric(), is.logical(), is.character(), is.list(), is.factor(), and is.data.frame(). As R is an object-oriented programming language, there are many functions (collectively known as generic functions) that will behave differently depending on the class of that particular object.

The mode of an object tells us how it's stored. It could happen that two different objects are stored in the same mode with different classes. How the two objects are printed using the print command is determined by its class. Here is an example:

# Output omitted due to space limitation
> num.obj <- seq(from=1,to=10,by=2)
> set.seed(1234) # To make the matrix reproducible
> mat.obj <- matrix(runif(9),ncol=3,nrow=3)
> mode(num.obj)
> mode(mat.obj)
> class(num.obj)
> class(mat.obj)
# prints a numeric object
> print(num.obj) 
# prints a matrix object
> print(mat.obj)

Like character and numeric, there is another method you can use to store data when the data is categorical in nature. In categorical data, we usually have some unique values and their corresponding labels. To store this type of object in R, we use the factor class. This class allows less storage location, because it is required to store unique levels only once.

Interestingly, once we try to see the mode of a factor object, it always shows as numeric, even if it displays character data. Here is an example:

> character.obj <- c("a","b","c")
> character.obj
[1] "a" "b" "c"

> is.factor(character.obj)
[1] FALSE

# Converting character object into factor object using as.factor()
> factor.obj <- as.factor(character.obj)
> factor.obj
[1] a b c
Levels: a b c 

> is.factor(factor.obj)
[1] TRUE

> mode(factor.obj)
[1] "numeric"

> class(factor.obj)
[1] "factor"

We have to be careful when dealing with the factor class data in R. The important thing to remember is that, for vectors (we will discuss vectors in the Vector section in this chapter), the class and mode will always be numeric, logical, or character. On the other hand, for matrices and arrays (we will discuss matrices and arrays in the Factor and its types section in this chapter), a class is always a matrix or array, but its mode can be numeric, character, or logical.

 

The R object structure and mode conversion


When we work with any statistical software, such as R, we rarely use single values for an object. We need to know how we can handle a collection of data values (for example, the age of 100 randomly selected diabetic patients), along with what type of objects are needed to store these data values. In R, the most convenient way to store more than one data value is vector (a collection of data values stored in a single object is known as a vector: for example, storing the ages of 100 diabetic patients in a single object). In fact, whenever we create an R object, it stores the values as a vector. It could be a single-element vector or a multiple-element vector. The num.obj vector we created in the previous section is a kind of vector that comprises numeric elements.

One of the simplest ways to create a vector in R is to use the c() function. Here is an example:

# creating vector of numeric element with "c" function
> num.vec <- c(1,3,5,7)
> num.vec
[1] 1 3 5 7
> mode(num.vec)
[1] "numeric"
> class(num.vec)
[1] "numeric"
> is.vector(num.vec)
[1] TRUE

If we create a vector with mixed elements (character and numeric), the resulting vector will be a character vector. Here is an example:

# Vector with mixed elements 
> num.char.vec <- c(1,3,"five",7)
> num.char.vec
[1] "1"    "3"    "five" "7"   
> mode(num.char.vec)
[1] "character"
> class(num.char.vec)
[1] "character"
> is.vector(num.char.vec)
[1] TRUE

We can create a big new vector by combining multiple vectors, and the resulting vector's mode will be character, if any element of any vector contains a character. The vector can be named, or it can be without a name. In the previous example, vectors were without names.

The following example shows how we can create a vector with the name of each element:

# combining multiple vectors
> comb.vec <- c(num.vec,num.char.vec)
> mode(comb.vec)
[1] "character"

# creating named vector
> named.num.vec <- c(x1=1,x2=3,x3=5)
> named.num.vec
x1 x2 x3 
1  3  5

The name of the elements in a vector can be assigned separately using the names() command. In R, any single constant is also stored as a vector of the single element.

Here is an example:

# vector of single element
> unit.vec <- 9
> is.vector(unit.vec)
[1] TRUE

R has six basic storage types of vectors, and each type is known as an atomic vector.

The following table shows the six basic vector types, their mode, and the storage mode:

Type

Mode

Storage mode

logical

logical

logical

integer

numeric

integer

double

numeric

double

complex

complex

complex

character

character

character

raw

raw

raw

Other than vectors, there are different storage types available in R to handle data with multiple elements; these are matrix, data frame, and list. We will discuss each of these types in subsequent sections.

To convert the object mode, R has user-friendly functions that can be depicted as as.x. Here, x could be numeric, logical, character, list, data frame, and so on. For example, if we need to perform a matrix operation that requires numeric mode, and the data is stored in some other mode, the operation cannot be performed. In this case, we need to convert that data into numeric mode.

In the following example, we will see how we can convert an object's mode:

# creating a vector of numbers and then converting it to logical # and character
> numbers.vec <- c(-3,-2,-1,0,1,2,3)
> numbers.vec
[1] -3 -2 -1  0  1  2  3
> num2char <- as.character(numbers.vec)
> num2char
[1] "-3" "-2" "-1" "0"  "1"  "2"  "3"
> num2logical <- as.logical(numbers.vec)
> num2logical
[1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

# creating character vector and then convert it to numeric and logical
> char.vec <- c("1","3","five","7")
> char.vec
[1] "1"    "3"    "five" "7"   
> char2num <- as.numeric(char.vec)
Warning message:
NAs introduced by coercion 
> char2num
[1]  1  3 NA  7
> char2logical <- as.logical(char.vec)
> char2logical
[1] NA NA NA NA

# logical to character conversion
> logical.vec <- c(TRUE, FALSE, FALSE,  TRUE,  TRUE)
> logical.vec
[1]  TRUE FALSE FALSE  TRUE  TRUE
> logical2char <- as.character(logical.vec)
> logical2char
[1] "TRUE"  "FALSE" "FALSE" "TRUE"  "TRUE"

Note that, when we convert numeric mode to logical mode, only 0 (zero) gets FALSE, and all the other values get TRUE. Also, if we convert a character object to numeric, it produces numeric elements and NA (if any actual character is present), where a warning will be issued. Importantly, R does not convert a character object into a logical object but, if we try to do this, all the resulting elements will be NA. However, logical objects get successfully converted to character objects.

Finally, we can say that any object can be converted to a character without offering any warning. However, if we want to convert character objects to any other type, we have to be careful.

Vector

R is a domain-specific programming language, specially designed to perform statistical analysis on data. In statistics, when we analyze data, the first thing that comes to mind is a variable with hundreds of observations in it. This reminds us of the picture of a vector. Probably, this is the main reason why, in R, the most elementary data type is a vector. A vector is a contiguous cell that contains data, where each cell can be accessed by an index:

> age <- c(10,20,30,40)

This is an example of a vector. The age of five individuals is stored in the age vector. Pay attention to how the vector was formed and stored under the age variable. Here, c() is a function used to create a vector, but this does not store all the data in the system. <- is called an assignment operator that is used to store a vector under a variable.

Now, in the console, let's type the following line and press Enter:

> age
 [1] 10 20 30 40

We successfully stored all the ages under the age variable, but what is [1]? This means that the index of the value 10 is 1.

If you want to see the first values of the vector, type the following command:

> age[3]
[1] 30

Why did R only show the index of the first value and not the other values? This is only to keep the output clean and informative. Every time R writes a new line, it first gives the index number of the next value. Pretty soon, you will be familiar with this convention. We can store a single value under a variable, but it will be a vector with one element:

> height<- 175

To show you that height is not a scalar but a vector with one element, we will store one additional value in it:

> height[2]<- 180

Pay attention to how we added another value inside an existing vector. Here, we put 180 in the second cell of the vector. Can you recall how we accessed the value in the second cell for the age variable? Using age[2], right? Similarly, we can assign a value to the second cell of the vector using the same syntax. Let's try to put another value inside the height variable:

> height[3] <- 165

Now, we can see all the values stored inside the height variable:

> height
[1] 175 180 165   

Although the basic data structure in R is vectors, there can be different types of vector. We use a numeric vector to store numeric data such as age, height, weight, and so on. Character vectors are used to store string data such as name, address, and so on. The way we can define a character vector in R is simple:

> name<- c("Rob", "Bob", "Jude","Monica")

When we want to store a character in R, we need to use double quotes, as used in the previous example. This tells R that this is a string input. We can put numeric values using double quotes but, if we use a character without double quotes, then it will return an error message.

Another special type of vector is the logical vector. There are two ways we could define a logical vector; first, we will show you the more formal way and, second, we will show you the quick way. There can be two possible elements in a logical vector: TRUE and FALSE. This logical vector is used in logical operations in R. It can be used to select specific rows from a dataset.

We can define a logical vector in the following way:

> logical<- c(TRUE, FALSE, TRUE, FALSE)

This logical vector can be used as a row selector of the age vector in the following way:

> age[logical]
[1] 10 30

Look closely to find out what we just did. We have seen how we can extract age from a vector using indexing. A logical vector can be thought of as a vector of an index. The first element of the logical vector is TRUE, which means that the first element of the age vector will be selected. The second element of the logical vector is FALSE. This means that the second element of the age vector will not be selected. So, the logical vector will select only the elements of the age vector for which the logical vector is TRUE. So, finally, two elements of the age vector will be selected, and a vector of two elements will be returned. A question that may come to your mind is, What can we do with this feature? The answer will be clearer in the Data frame section.

 

Factor and its types


A factor is another important data type in R, especially when we deal with categorical variables. In an R vector, there is no limit on the number of distinct elements but, in factor variables, it takes only a limited number of distinct elements. This type of variable is usually referred to as a categorical variable during data analysis and statistical modeling. In statistical modeling, the behavior of a numeric variable and categorical variable is different, so it is important to store the data correctly to ensure valid statistical analysis.

In R, a factor variable stores distinct numeric values internally and uses another character set to display the contents of that variable. In other software, such as Stata, internal numeric values are known as values, and the character set is known as value labels. Previously, we saw that the mode of a factor variable is numeric; this is due to the internal values of the factor variable.

A factor variable can be created using the factor command; the only required input is a vector of values, which will be returned as a vector of factor values. The input can be numeric or character, but the levels of factor will always be a character. The following example shows how to create factor variables:

#creating factor variable with only one argument with factor() 
> factor1 <- factor(c(1,2,3,4,5,6,7,8,9))
> factor1
[1] 1 2 3 4 5 6 7 8 9
Levels: 1 2 3 4 5 6 7 8 9
> levels(factor1)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9"
> labels(factor)
[1] "1"
> labels(factor1)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9"

#creating factor with user given levels to display
> factor2 <- factor(c(1,2,3,4,5,6,7,8,9),labels=letters[1:9])
> factor2
[1] a b c d e f g h i
Levels: a b c d e f g h i
> levels(factor2)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i"
> labels(factor2)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9"

In a factor variable, the values themselves are stored as numeric vectors, whereas the labels store only unique characters, and a label stores only once for each unique character. Factors can be ordered if the ordered=T command is specified; otherwise, they inherit the order of the levels specified.

A factor could be numeric with numeric levels, but direct mathematical operations are not possible with this numeric factor. Special care should be taken if we want to use mathematical operations.

The following example shows a numeric factor and its mathematical operation:

# creating numeric factor and trying to find out mean
> num.factor <- factor(c(5,7,9,5,6,7,3,5,3,9,7))
> num.factor
[1] 5 7 9 5 6 7 3 5 3 9 7
Levels: 3 5 6 7 9
> mean(num.factor)
[1] NA
Warning message:
In mean.default(num.factor) :
argument is not numeric or logical: returning NA

From the preceding example, we see that we can create a numeric factor, but the mathematical operation is not possible. When we tried to perform a mathematical operation, it returned a warning message and produced the result NA. To perform any mathematical operation, we need to convert the factor to its numeric counterpart. One can assume that we can easily convert the factor to numeric using the as.numeric() function but, if we use the as.numeric() function, it will only convert the internal values of the factors, not the desired values.

So, the conversion must be done with levels of that factor variable; optionally, we can first convert the factor into a character using as.character() and then use as.numeric().

The following example describes this scenario:

> num.factor <- factor(c(5,7,9,5,6,7,3,5,3,9,7))
> num.factor
[1] 5 7 9 5 6 7 3 5 3 9 7
Levels: 3 5 6 7 9
#as.numeric() function only returns internal values of the factor
> as.numeric(num.factor)
[1] 2 4 5 2 3 4 1 2 1 5 4
# now see the levels of the factor
> levels(num.factor)
[1] "3" "5" "6" "7" "9"
> as.character(num.factor)
[1] "5" "7" "9" "5" "6" "7" "3" "5" "3" "9" "7"

# now to convert the "num.factor" to numeric there are two method
# method-1: 
> mean(as.numeric(as.character(num.factor)))
[1] 6

# method-2:
> mean(as.numeric(levels(num.factor)[num.factor]))
[1] 6

Data frame

A data frame is a rectangular arrangement of rows and columns of vectors and/or factors, such as a spreadsheet in MS Excel. The columns represent variables in the data, and the rows represent observations or records. In other software, such as a database package, each column represents a field, and each row represents a record. Dealing with data does not mean dealing with only one vector or factor variable; it is rather a collection of variables. Each column represents only one type of data: numeric, character, or logical. Each row represents case information across all columns. One important thing to remember about R data frames is that all vectors should be of the same length. In an R data frame, we can store different types of variables, such as numeric, logical, factor, and character. To create a data frame, we can use the data.frame() command.

The following example shows us how to create a data frame using different vectors and factors:

#creating vector of different variables and then creating data frame
> var1 <- c(101,102,103,104,105)
> var2 <- c(25,22,29,34,33)
> var3 <- c("Non-Diabetic", "Diabetic", "Non-Diabetic", "Non-Diabetic", "Diabetic")
> var4 <- factor(c("male","male","female","female","male"))
# now we will create data frame using two numeric vectors one 
# character vector and one factor
> diab.dat <- data.frame(var1,var2,var3,var4)
> diab.dat
   var1 var2         var3   var4
1  101   25 Non-Diabetic   male
2  102   22     Diabetic   male
3  103   29 Non-Diabetic female
4  104   34 Non-Diabetic female
5  105   33     Diabetic   male

Now, if we look at the class of individual columns of the newly created data frame, we will see that the first two columns' classes are numeric, and the last two columns' classes are factor, though, initially, the class of var3 was character. One thing is obvious here—when we create data frames and any one of the column's classes is character, it automatically gets converted to factor, which is a default R operation. However, there is one argument, stringsAsFactors=FALSE, that allows us to prevent the automatic conversion of character to factor during data frame creation.

In the following example, we will see this:

#class of each column before creating data frame 
> class(var1)
[1] "numeric"
> class(var2)
[1] "numeric"
> class(var3)
[1] "character"
> class(var4)
[1] "factor"
# class of each column after creating data frame
> class(diab.dat$var1)
[1] "numeric"
> class(diab.dat$var2)
[1] "numeric"
> class(diab.dat$var3)
[1] "factor"
> class(diab.dat$var4)
[1] "factor"
# now create the data frame specifying as.is=TRUE
> diab.dat.2 <- data.frame(var1,var2,var3,var4,stringsAsFactors=FALSE)
> diab.dat.2
var1 var2         var3   var4
1  101   25 Non-Diabetic   male
2  102   22     Diabetic   male
3  103   29 Non-Diabetic female
4  104   34 Non-Diabetic female
5  105   33     Diabetic   male

> class(diab.dat.2$var3)
[1] "character"

To access individual columns (variables) from a data frame, we can use a dollar ($) sign, along with the data frame name–for example, diab.dat$var1.

There are some other ways to access variables from a data frame, such as the following:

  • The data frame name followed by double square brackets with variable names within quotation marks–for example, diab.dat[["var1"]]

  • The data frame name followed by single square brackets with the column index–for example, diab.dat[,1]

Besides these, there is one other way that allows us to access each of the individual variables as separate objects. The R attach() function allows us to access individual variables as separate R objects. When we use the attach() command, we need to use detach() to remove individual variables from the working environment.

Let's have a look at the following code:

# To run the folloing code snipped, 
# the code block 16 need to run.
# Especially var1 var2 var3 and var4. 
# After that, from code block 17 "diab.dat.2" object should run
 
# The following line will remove var1 to var4 
# object from the workspace
> rm(var1);rm(var2);rm(var3);rm(var4)
# The following command will allow 
# us to access individual variables 
> attach(diab.dat.2)
# Printing valuse of var1
> var1
# checking calss of var3
> class(var3)
# Now to detach the data frame from the workspace
> detach(diab.dat.2)
# Now if we try to print individual varaiable it will give error
> var1

Matrices

A matrix is also a two-dimensional arrangement of data, but it can take only one class. To perform any mathematical operations, all columns of a matrix should be numeric. However, in data frames, we can store numeric, character, or factor columns. To perform any mathematical operation, especially a matrix operation, we can use matrix objects. However, in data frames, we are unable to perform certain types of mathematical operation, such as matrix multiplication. To create a matrix, we can use the matrix() command or convert a numeric data frame to a matrix using as.matrix().

We can convert the data frame that we created earlier as diab.dat to a matrix using as.matrix(). However, this is not suitable for performing mathematical operations, as shown in the following example:

# data frame to matrix conversion
> mat.diab <- as.matrix(diab.dat)
> mat.diab
     var1  var2 var3           var4    
[1,] "101" "25" "Non-Diabetic" "male"  
[2,] "102" "22" "Diabetic"     "male"  
[3,] "103" "29" "Non-Diabetic" "female"
[4,] "104" "34" "Non-Diabetic" "female"
[5,] "105" "33" "Diabetic"     "male"

> class(mat.diab)
[1] "matrix"
> mode(mat.diab)
[1] "character"

# matrix multiplication is not possible with this newly created matrix

> t(mat.diab) %*% mat.diab
Error in t(mat.diab) %*% mat.diab : 
requires numeric/complex matrix/vector arguments

# creating a matrix with numeric elements only
# To produce the same matrix over time we set a seed value
> set.seed(12345) 
> num.mat <- matrix(rnorm(9),nrow=3,ncol=3)
> num.mat
           [,1]       [,2]       [,3]
[1,]  0.5855288 -0.4534972  0.6300986
[2,]  0.7094660  0.6058875 -0.2761841
[3,] -0.1093033 -1.8179560 -0.2841597

> class(num.mat)
[1] "matrix"
> mode(num.mat)
[1] "numeric"

# matrix multiplication
> t(num.mat) %*% num.mat
          [,1]       [,2]       [,3]
[1,] 0.8581332 0.36302951 0.20405722
[2,] 0.3630295 3.87772320 0.06350551
[3,] 0.2040572 0.06350551 0.55404860

Arrays

An array is a multiply subscripted data entry that allows the storing of data frames, matrices, or vectors of different types. Data frames and matrices are of two dimensions only, but an array can be of any number of dimensions. Sometimes, we need to store multiple matrices or data frames into a single object; in this case, we can use arrays to store this data.

Here is a simple example to store three matrices of order 2 x 2 in a single array object:

> mat.array=array(dim=c(2,2,3))

# To produce the same results over time we set a seed value
> set.seed(12345)

> mat.array[,,1]<-rnorm(4)
> mat.array[,,2]<-rnorm(4)
> mat.array[,,3]<-rnorm(4)

> mat.array
, , 1

          [,1]       [,2]
[1,] 0.5855288 -0.1093033
[2,] 0.7094660 -0.4534972

, , 2

           [,1]       [,2]
[1,]  0.6058875  0.6300986
[2,] -1.8179560 -0.2761841

, , 3

           [,1]       [,2]
[1,] -0.2841597 -0.1162478
[2,] -0.9193220  1.8173120

List

A list object is a generic R object that can store other objects of any type. In a list object, we can store single constants, vectors of numeric values, factors, data frames, matrices, and even arrays.

Recalling the var1, var2, var3, and var4 vectors, the data frame created using these vectors, and also recalling the array created in the Arrays section, we will create a list object in the following example:

> var1 <- c(101,102,103,104,105)
> var2 <- c(25,22,29,34,33)
> var3 <- c("Non-Diabetic", "Diabetic", "Non-Diabetic", "Non-Diabetic", "Diabetic")
> var4 <- factor(c("male","male","female","female","male"))
> diab.dat <- data.frame(var1,var2,var3,var4)

> mat.array<-array(dim=c(2,2,3))

> set.seed(12345)

> mat.array[,,1]<-rnorm(4)
> mat.array[,,2]<-rnorm(4)
> mat.array[,,3]<-rnorm(4)

# creating list
> obj.list <- list(elem1=var1,elem2=var2,elem3=var3,elem4=var4,elem5=diab.dat,elem6=mat.array) 


> obj.list
$elem1
[1] 101 102 103 104 105

$elem2
[1] 25 22 29 34 33

$elem3
[1] "Non-Diabetic" "Diabetic"     "Non-Diabetic" "Non-Diabetic" "Diabetic"    

$elem4
[1] male   male   female female male  
Levels: female male

$elem5
  var1 var2         var3   var4
1  101   25 Non-Diabetic   male
2  102   22     Diabetic   male
3  103   29 Non-Diabetic female
4  104   34 Non-Diabetic female
5  105   33     Diabetic   male

$elem6
, , 1

          [,1]       [,2]
[1,] 0.5855288 -0.1093033
[2,] 0.7094660 -0.4534972

, , 2

           [,1]       [,2]
[1,]  0.6058875  0.6300986
[2,] -1.8179560 -0.2761841

, , 3

           [,1]       [,2]
[1,] -0.2841597 -0.1162478
[2,] -0.9193220  1.8173120

To access individual elements from a list object, we can use the name of that element or use double square brackets with the index of those elements. For example, obj.list[[1]] will give the first element of the newly created list object.

 

Missing values in R


Missing values are part of the data-manipulation process, and we will encounter some missing values in almost every dataset. So, it is important to know how R handles missing values and how they are represented. In R, a numeric missing value is represented by NA, while character missing values are represented by <NA>. To test if there is any missing value present in a dataset (data frame), we can use is.na() for each column; alternatively, we can use this function in combination with the any() function.

The following example shows whether there is any missing value present in a dataset:

> missing_dat <- data.frame(v1=c(1,NA,0,1),v2=c("M","F",NA,"M"))
> missing_dat
  v1   v2
1  1    M
2 NA    F
3  0 <NA>
4  1    M

> is.na(missing_dat$v1)
[1] FALSE  TRUE FALSE FALSE
> is.na(missing_dat$v2)
[1] FALSE FALSE  TRUE FALSE
> any(is.na(missing_dat))
[1] TRUE
 

Summary


In this chapter, we first talked very briefly about what R is, where and how to get it, and how to install it. We then covered why we should use R and compared it with other available software. After that, we described what R objects are, their modes, and classes. We also highlighted how we can convert modes of objects using R functions, such as as.numeric and as.character. Finally, we discussed different R objects, such as vector, factor, data frame, matrix, array, and list. The chapter ended with an introduction to how missing values are represented and dealt with in R.

In the next chapter, we will discuss data manipulation with different R objects in greater detail.

About the Authors
  • Jaynal Abedin

    Jaynal Abedin currently holds the position of Statistician at the Centre for Communicable Diseases (CCD) at icddr,b ( www.icddrb.org). He attained his Bachelor's and Master's degrees in Statistics from the University of Rajshahi, Rajshahi, Bangladesh. He has vast experience in R programming and Stata and has efficient leadership qualities. He is currently leading a team of statisticians. He has hands-on experience in developing training material and facilitating training in R programming and Stata along with statistical aspects in public health research. His primary area of interest in research includes causal inference and machine learning. He is currently involved in several ongoing public health research projects and is a co-author of several work-in-progress manuscripts. In the useR! Conference 2013, he presented a poster—edeR: Email Data Extraction using R, available at http://www.edii.uclm.es/~useR-2013/abstracts/files/34_edeR_Email_Data_Extraction_using_R.pdf—and obtained the best application poster award. He is also involved in reviewing scientific manuscripts for the Journal of Applied Statistics (JAS) and the Journal of Health Population and Nutrition (JHPN). He is also a successful freelance statistician on online platforms and has an excellent reputation through his high-quality work, especially in R programming. He can be contacted at joystatru@gmail.com, http://bd.linkedin.com/in/jaynal; his Twitter handle is @jaynal83.

    Browse publications by this author
  • Kishor Kumar Das

    Kishor Kumar Das is a statistician at the International Centre for Diarrhoeal Disease Research, Bangladesh, an internationally recognized organization that focuses mainly on public health research. He completed his MSc and BSc in applied statistics from the Institute of Statistical Research and Training, University of Dhaka, Bangladesh. He has extensively used R for data processing, statistical analysis, and graphs for more than 10 years. His research interests are survival analysis, machine learning, and statistical computing.

    Browse publications by this author
Data Manipulation with R - Second Edition
Unlock this book and the full library FREE for 7 days
Start now