Home Data Machine Learning with R - Fourth Edition

Machine Learning with R - Fourth Edition

By Brett Lantz
books-svg-icon Book
Subscription FREE
eBook + Subscription €14.99
eBook €29.99
Print + eBook €37.99
READ FOR FREE Free Trial for 7 days. €14.99 p/m after trial. Cancel Anytime! BUY NOW BUY NOW BUY NOW
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
READ FOR FREE Free Trial for 7 days. €14.99 p/m after trial. Cancel Anytime! BUY NOW BUY NOW BUY NOW
Subscription FREE
eBook + Subscription €14.99
eBook €29.99
Print + eBook €37.99
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
  1. Free Chapter
    Managing and Understanding Data
About this book
Machine learning, at its core, is concerned with transforming data into actionable knowledge. R offers a powerful set of machine learning methods to quickly and easily gain insight from your data. Machine Learning with R, Fourth Edition, provides a hands-on, accessible, and readable guide to applying machine learning to real-world problems. Whether you are an experienced R user or new to the language, Brett Lantz teaches you everything you need to know for data pre-processing, uncovering key insights, making new predictions, and visualizing your findings. This 10th Anniversary Edition features several new chapters that reflect the progress of machine learning in the last few years and help you build your data science skills and tackle more challenging problems, including making successful machine learning models and advanced data preparation, building better learners, and making use of big data. You'll also find this classic R data science book updated to R 4.0.0 with newer and better libraries, advice on ethical and bias issues in machine learning, and an introduction to deep learning. Whether you're looking to take your first steps with R for machine learning or making sure your skills and knowledge are up to date, this is an unmissable read that will help you find powerful new insights in your data.
Publication date:
May 2023
Publisher
Packt
Pages
762
ISBN
9781801071321

 

Managing and Understanding Data

A key early component of any machine learning project involves managing and understanding data. Although this may not be as gratifying as building and deploying models—the stages in which you begin to see the fruits of your labor—it is unwise to ignore this important preparatory work.

Any learning algorithm is only as good as its training data, and in many cases, this data is complex, messy, and spread across multiple sources and formats. Due to this complexity, often the largest portion of effort invested in machine learning projects is spent on data preparation and exploration.

This chapter approaches data preparation in three ways. The first section discusses the basic data structures R uses to store data. You will become very familiar with these structures as you create and manipulate datasets. The second section is practical, as it covers several functions that are used for getting data in and out of R. In the third section, methods for understanding data are illustrated while exploring a real-world dataset.

By the end of this chapter, you will understand:

  • How to use R’s basic data structures to store and manipulate values
  • Simple functions to get data into R from common source formats
  • Typical methods to understand and visualize complex data

The ways R handles data will dictate the ways you must work with data, so it is helpful to understand R’s data structures before jumping directly into data preparation. However, if you are already familiar with R programming, feel free to skip ahead to the section on data preprocessing.

 

R data structures

There are numerous types of data structures found in programming languages, each with strengths and weaknesses suited to specific tasks. Since R is a programming language used widely for statistical data analysis, the data structures it utilizes were designed with this type of work in mind.

The R data structures used most frequently in machine learning are vectors, factors, lists, arrays, matrices, and data frames. Each is tailored to a specific data management task, which makes it important to understand how they will interact in your R project. In the sections that follow, we will review their similarities and differences.

Vectors

The fundamental R data structure is a vector, which stores an ordered set of values called elements. A vector can contain any number of elements. However, all of a vector’s elements must be of the same type; for instance, a vector cannot contain both numbers and text. To determine the type of vector v, use the typeof(v) command. Note that R is a case-sensitive language, which means that lower-case v and upper-case V could represent two different vectors. This is also true for R’s built-in functions and keywords, so be sure to always use the correct capitalization when typing R commands or expressions.

Several vector types are commonly used in machine learning: integer (numbers without decimals), double (numbers with decimals), character (text data, also commonly called “string” data), and logical (TRUE or FALSE values). Some R functions will report both integer and double vectors as numeric, while others distinguish between the two; generally, this distinction is unimportant. Vectors of logical values are used often in R, but notice that the TRUE and FALSE values must be written in all caps. This is slightly different from some other programming languages.

There are also two special values that are relevant to all vector types: NA, which indicates a missing value, and NULL, which is used to indicate the absence of any value. Although these two may seem to be synonymous, they are indeed slightly different. The NA value is a placeholder for something else and therefore has a length of one, while the NULL value is truly empty and has a length of zero.

It is tedious to enter large amounts of data by hand, but simple vectors can be created by using the c() combine function. The vector can also be given a name using the arrow <- operator. This is R’s assignment operator, used much like the = assignment operator used in many other programming languages.

R also allows the use of the = operator for assignment, but it is considered a poor coding style according to commonly accepted style guidelines.

For example, let’s construct a set of vectors containing data on three medical patients. We’ll create a character vector named subject_name to store the three patient names, a numeric vector named temperature to store each patient’s body temperature in degrees Fahrenheit, and a logical vector named flu_status to store each patient’s diagnosis (TRUE if they have influenza, FALSE otherwise). As shown in the following code, the three vectors are:

> subject_name <- c("John Doe", "Jane Doe", "Steve Graves")
> temperature <- c(98.1, 98.6, 101.4)
> flu_status <- c(FALSE, FALSE, TRUE)

Values stored in R vectors retain their order. Therefore, data for each patient can be accessed using their position in the set, beginning at 1, then supplying this number inside square brackets (that is, [ and ]) following the name of the vector. For instance, to obtain the temperature value for patient Jane Doe, the second patient, simply type:

> temperature[2]
[1] 98.6

R offers a variety of methods to extract data from vectors. A range of values can be obtained using the colon operator. For instance, to obtain the body temperature of the second and third patients, type:

> temperature[2:3]
[1] 98.6 101.4

Items can be excluded by specifying a negative item number. To exclude the second patient’s temperature data, type:

> temperature[-2]
[1]  98.1 101.4

It is also sometimes useful to specify a logical vector indicating whether each item should be included. For example, to include the first two temperature readings but exclude the third, type:

> temperature[c(TRUE, TRUE, FALSE)]
[1] 98.1 98.6

The importance of this type of operation is clearer with the realization that the result of a logical expression like temperature > 100 is a logical vector. This expression returns TRUE or FALSE depending on whether the temperature is greater than 100 degrees Fahrenheit, which indicates a fever. Therefore, the following commands will identify the patients exhibiting a fever:

> fever <- temperature > 100
> subject_name[fever]
[1] "Steve Graves"

Alternatively, the logical expression can also be moved inside the brackets, which returns the same result in a single step:

> subject_name[temperature > 100]
[1] "Steve Graves"

As you will see shortly, the vector provides the foundation for many other R data structures and can be combined with programming expressions to complete more complex operations for selecting data and constructing new features. Therefore, knowing the various vector operations is crucial for working with data in R.

Factors

Recall from Chapter 1, Introducing Machine Learning, that nominal features represent a characteristic with categories of values. Although it is possible to use a character vector to store nominal data, R provides a data structure specifically for this task.

A factor is a special type of vector that is solely used for representing categorical or ordinal data. In the medical dataset we are building, we might use a factor to represent the patients’ biological sex and record two categories: male and female.

Why use factors rather than character vectors? One advantage of factors is that the category labels are stored only once. Rather than storing MALE, MALE, FEMALE, the computer may store 1, 1, 2, which can reduce the memory needed to store the values. Additionally, many machine learning algorithms handle nominal and numeric features differently. Coding categorical features as factors allows R to treat the categorical features appropriately.

A factor should not be used for character vectors with values that don’t truly fall into categories. If a vector stores mostly unique values such as names or identification codes like social security numbers, keep it as a character vector.

To create a factor from a character vector, simply apply the factor() function. For example:

> gender <- factor(c("MALE", "FEMALE", "MALE"))
> gender
[1] MALE   FEMALE MALE
Levels: FEMALE MALE

Notice that when the gender factor was displayed, R printed additional information about its levels. The levels comprise the set of possible categories the factor could take, in this case, MALE or FEMALE.

When we create factors, we can add additional levels that may not appear in the original data. Suppose we created another factor for blood type, as shown in the following example:

> blood <- factor(c("O", "AB", "A"),
            levels = c("A", "B", "AB", "O"))
> blood
[1] O  AB A
Levels: A B AB O

When we defined the blood factor, we specified an additional vector of four possible blood types using the levels parameter. As a result, even though our data includes only blood types O, AB, and A, all four types are retained with the blood factor, as the output shows. Storing the additional level allows for the possibility of adding patients with the other blood type in the future. It also ensures that if we were to create a table of blood types, we would know that type B exists, despite it not being found in our initial data.

The factor data structure also allows us to include information about the order of a nominal feature’s categories, which provides a method for creating ordinal features. For example, suppose we have data on the severity of patient symptoms, coded in increasing order of severity from mild, to moderate, to severe. We indicate the presence of ordinal data by providing the factor’s levels in the desired order, listed ascending from lowest to highest, and setting the ordered parameter to TRUE as shown:

> symptoms <- factor(c("SEVERE", "MILD", "MODERATE"),
               levels = c("MILD", "MODERATE", "SEVERE"),
               ordered = TRUE)

The resulting symptoms factor now includes information about the requested order. Unlike our prior factors, the levels of this factor are separated by < symbols to indicate the presence of a sequential order from MILD to SEVERE:

> symptoms
[1] SEVERE   MILD     MODERATE
Levels: MILD < MODERATE < SEVERE

A helpful feature of ordered factors is that logical tests work as you would expect. For instance, we can test whether each patient’s symptoms are more severe than moderate:

> symptoms > "MODERATE"
[1]  TRUE FALSE FALSE

Machine learning algorithms capable of modeling ordinal data will expect ordered factors, so be sure to code your data accordingly.

Lists

A list is a data structure, much like a vector, in that it is used for storing an ordered set of elements. However, where a vector requires all its elements to be the same type, a list allows different R data types to be collected. Due to this flexibility, lists are often used to store various types of input and output data and sets of configuration parameters for machine learning models.

To illustrate lists, consider the medical patient dataset we have been constructing, with data for three patients stored in six vectors. If we wanted to display all the data for the first patient, we would need to enter five R commands:

> subject_name[1]
[1] "John Doe"
> temperature[1]
[1] 98.1
> flu_status[1]
[1] FALSE
> gender[1]
[1] MALE
Levels: FEMALE MALE
> blood[1]
[1] O
Levels: A B AB O
> symptoms[1]
[1] SEVERE
Levels: MILD < MODERATE < SEVERE

If we expect to examine the patient’s data again in the future, rather than retyping these commands, a list allows us to group all the values into one object we can use repeatedly.

Similar to creating a vector with c(), a list is created using the list() function, as shown in the following example. One notable difference is that when a list is constructed, each component in the sequence should be given a name. The names are not strictly required, but allow the values to be accessed later by name rather than by numbered position and a mess of square brackets. To create a list with named components for the first patient’s values, type the following:

> subject1 <- list(fullname = subject_name[1],
                   temperature = temperature[1],
                   flu_status = flu_status[1],
                   gender = gender[1],
                   blood = blood[1],
                   symptoms = symptoms[1])

This patient’s data is now collected in the subject1 list:

> subject1
$fullname
[1] "John Doe"
$temperature
[1] 98.1
$flu_status
[1] FALSE
$gender
[1] MALE
Levels: FEMALE MALE
$blood
[1] O
Levels: A B AB O
$symptoms
[1] SEVERE
Levels: MILD < MODERATE < SEVERE

Note that the values are labeled with the names we specified in the preceding command. As a list retains order like a vector, its components can be accessed using numeric positions, as shown here for the temperature value:

> subject1[2]
$temperature
[1] 98.1

The result of using vector-style operators on a list object is another list object, which is a subset of the original list. For example, the preceding code returned a list with a single temperature component. To instead return a single list item in its native data type, use double brackets ([[ and ]]) when selecting the list component. For example, the following command returns a numeric vector of length 1:

> subject1[[2]]
[1] 98.1

For clarity, it is often better to access list components by name, by appending a $ and the component name to the list name as follows:

> subject1$temperature
[1] 98.1

Like the double-bracket notation, this returns the list component in its native data type (in this case, a numeric vector of length 1).

Accessing the value by name also ensures that the correct item is retrieved even if the order of the list elements is changed later.

It is possible to obtain several list items by specifying a vector of names. The following returns a subset of the subject1 list, which contains only the temperature and flu_status components:

> subject1[c("temperature", "flu_status")]
$temperature
[1] 98.1
$flu_status
[1] FALSE

Entire datasets could be constructed using lists, and lists of lists. For example, you might consider creating a subject2 and subject3 list and grouping these into a list object named pt_data. However, constructing a dataset in this way is common enough that R provides a specialized data structure specifically for this task.

Data frames

By far the most important R data structure for machine learning is the data frame, a structure analogous to a spreadsheet or database in that it has both rows and columns of data. In R terms, a data frame can be understood as a list of vectors or factors, each having exactly the same number of values. Because the data frame is literally a list of vector-type objects, it combines aspects of both vectors and lists.

Let’s create a data frame for our patient dataset. Using the patient data vectors we created previously, the data.frame() function combines them into a data frame:

> pt_data <- data.frame(subject_name, temperature, 
                        flu_status, gender, blood, symptoms)

When displaying the pt_data data frame, we see that the structure is quite different from the data structures we’ve worked with previously:

> pt_data
  subject_name temperature flu_status gender blood symptoms
1     John Doe        98.1      FALSE   MALE     O   SEVERE
2     Jane Doe        98.6      FALSE FEMALE    AB     MILD
3 Steve Graves       101.4       TRUE   MALE     A MODERATE

Compared to one-dimensional vectors, factors, and lists, a data frame has two dimensions and is displayed in a tabular format. Our data frame has one row for each patient and one column for each vector of patient measurements. In machine learning terms, the data frame’s rows are the examples, and the columns are the features or attributes.

To extract entire columns (vectors) of data, we can take advantage of the fact that a data frame is simply a list of vectors. Like lists, the most direct way to extract a single element is by referring to it by name. For example, to obtain the subject_name vector, type:

> pt_data$subject_name
[1] "John Doe"     "Jane Doe"     "Steve Graves"

Like lists, a vector of names can be used to extract multiple columns from a data frame:

> pt_data[c("temperature", "flu_status")]
  temperature flu_status
1        98.1      FALSE
2        98.6      FALSE
3       101.4       TRUE

When we request data frame columns by name, the result is a data frame containing all rows of data for the specified columns. The command pt_data[2:3] will also extract the temperature and flu_status columns. However, referring to the columns by name results in clear and easy-to-maintain R code, which will not break if the data frame is later reordered.

To extract specific values from the data frame, methods like those for accessing values in vectors are used. However, there is an important distinction—because the data frame is two-dimensional, both the desired rows and columns must be specified. Rows are specified first, followed by a comma, followed by the columns in a format like this: [rows, columns]. As with vectors, rows and columns are counted beginning at one.

For instance, to extract the value in the first row and second column of the patient data frame, use the following command:

> pt_data[1, 2]
[1] 98.1

If you would like more than a single row or column of data, specify vectors indicating the desired rows and columns. The following statement will pull data from the first and third rows and the second and fourth columns:

> pt_data[c(1, 3), c(2, 4)]
  temperature gender
1        98.1   MALE
3       101.4   MALE

To refer to every row or every column, simply leave the row or column portion blank. For example, to extract all rows of the first column:

> pt_data[, 1]
[1] "John Doe"     "Jane Doe"     "Steve Graves"

To extract all columns for the first row:

> pt_data[1, ]
  subject_name temperature flu_status gender blood symptoms
1     John Doe        98.1      FALSE   MALE     O   SEVERE

And to extract everything:

> pt_data[ , ]
  subject_name temperature flu_status gender blood symptoms
1     John Doe        98.1      FALSE   MALE     O   SEVERE
2     Jane Doe        98.6      FALSE FEMALE    AB     MILD
3 Steve Graves       101.4       TRUE   MALE     A MODERATE

Of course, columns are better accessed by name rather than position, and negative signs can be used to exclude rows or columns of data. Therefore, the output of the command:

> pt_data[c(1, 3), c("temperature", "gender")]
  temperature gender
1        98.1   MALE
3       101.4   MALE

is equivalent to:

> pt_data[-2, c(-1, -3, -5, -6)]
  temperature gender
1        98.1   MALE
3       101.4   MALE

We often need to create new columns in data frames—perhaps, for instance, as a function of existing columns. For example, we may need to convert the Fahrenheit temperature readings in the patient data frame into the Celsius scale. To do this, we simply use the assignment operator to assign the result of the conversion calculation to a new column name as follows:

> pt_data$temp_c <- (pt_data$temperature - 32) * (5 / 9)

To confirm the calculation worked, let’s compare the new Celsius-based temp_c column to the previous Fahrenheit-scale temperature column:

> pt_data[c("temperature", "temp_c")]
  temperature   temp_c
1        98.1 36.72222
2        98.6 37.00000
3       101.4 38.55556

Seeing these side by side, we can confirm that the calculation has worked correctly.

As these types of operations are crucial for much of the work we will do in upcoming chapters, it is important to become very familiar with data frames. You might try practicing similar operations with the patient dataset, or even better, use data from one of your own projects—the functions to load your own data files into R will be described later in this chapter.

Matrices and arrays

In addition to data frames, R provides other structures that store values in tabular form. A matrix is a data structure that represents a two-dimensional table with rows and columns of data. Like vectors, R matrices can contain only one type of data, although they are most often used for mathematical operations and therefore typically store only numbers.

To create a matrix, simply supply a vector of data to the matrix() function, along with a parameter specifying the number of rows (nrow) or number of columns (ncol). For example, to create a 2x2 matrix storing the numbers one to four, we can use the nrow parameter to request the data be divided into two rows:

> m <- matrix(c(1, 2, 3, 4), nrow = 2)
> m
     [,1] [,2]
[1,]    1    3
[2,]    2    4

This is equivalent to the matrix produced using ncol = 2:

> m <- matrix(c(1, 2, 3, 4), ncol = 2)
> m
     [,1] [,2]
[1,]    1    3
[2,]    2    4

You will notice that R loaded the first column of the matrix first before loading the second column. This is called column-major order, which is R’s default method for loading matrices.

To override this default setting and load a matrix by rows, set the parameter byrow = TRUE when creating the matrix.

To illustrate this further, let’s see what happens if we add more values to the matrix. With six values, requesting two rows creates a matrix with three columns:

> m <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2)
> m
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Requesting two columns creates a matrix with three rows:

> m <- matrix(c(1, 2, 3, 4, 5, 6), ncol = 2)
> m
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

As with data frames, values in matrices can be extracted using [row, column] notation. For instance, m[1, 1] will return the value 1 while m[3, 2] will extract 6 from the m matrix. Additionally, entire rows or columns can be requested:

> m[1, ]
[1] 1 4
> m[, 1]
[1] 1 2 3

Closely related to the matrix structure is the array, which is a multidimensional table of data. Where a matrix has rows and columns of values, an array has rows, columns, and one or more additional layers of values. Although we will occasionally use matrices in later chapters, the use of arrays is unnecessary within the scope of this book.

 

Managing data with R

One of the challenges faced while working with massive datasets involves gathering, preparing, and otherwise managing data from a variety of sources. Although we will cover data preparation, data cleaning, and data management in depth by working on real-world machine learning tasks in later chapters, this section highlights the basic functionality for getting data in and out of R.

Saving, loading, and removing R data structures

When you’ve spent a lot of time getting a data frame into the desired form, you shouldn’t need to recreate your work each time you restart your R session.

To save data structures to a file that can be reloaded later or transferred to another system, the save() function can be used to write one or more R data structures to the location specified by the file parameter. R data files have an .RData or .rda extension.

Suppose you had three objects named x, y, and z that you would like to save to a permanent file. These might be vectors, factors, lists, data frames, or any other R object. To save them to a file named mydata.RData, use the following command:

> save(x, y, z, file = "mydata.RData")

The load() command can recreate any data structures that have been saved to an .RData file. To load the mydata.RData file created in the preceding code, simply type:

> load("mydata.RData")

This will recreate the x, y, and z data structures in your R environment.

Be careful what you are loading! All data structures stored in the file you are importing with the load() command will be added to your workspace, even if they overwrite something else you are working on.

Alternatively, the saveRDS() function can be used to save a single R object to a file. Although it is much like the save() function, a key distinction is that the corresponding loadRDS() function allows the object to be loaded with a different name to the original object. For this reason, saveRDS() may be safer to use when transferring R objects across projects, because it reduces the risk of accidentally overwriting existing objects in the R environment.

The saveRDS() function is especially helpful for saving machine learning model objects. Because some machine learning algorithms take a long time to train the model, saving the model to an .rds file can help avoid a long re-training process when a project is resumed. For example, to save a model object named my_model to a file named my_model.rds, use the following syntax:

> saveRDS(my_model, file = "my_model.rds")

To load the model, use the readRDS() function and assign the result an object name as follows:

> my_model <- readRDS("my_model.rds")

After you’ve been working in an R session for some time, you may have accumulated unused data structures. In RStudio, these objects are visible in the Environment tab of the interface, but it is also possible to access these objects programmatically using the listing function ls(), which returns a vector of all data structures currently in memory.

For example, if you’ve been following along with the code in this chapter, the ls() function returns the following:

> ls()
 [1] "blood"        "fever"        "flu_status"   "gender"      
 [5] "m"            "pt_data"      "subject_name" "subject1"    
 [9] "symptoms"     "temperature"

R automatically clears all data structures from memory upon quitting the session, but for large objects, you may want to free up the memory sooner. The remove function rm() can be used for this purpose. For example, to eliminate the m and subject1 objects, simply type:

> rm(m, subject1)

The rm() function can also be supplied with a character vector of object names to remove. This works with the ls() function to clear the entire R session:

> rm(list = ls())

Be very careful when executing the preceding code, as you will not be prompted before your objects are removed!

If you need to wrap up your R session in a hurry, the save.image() command will write your entire session to a file simply called .RData. By default, when quitting R or RStudio, you will be asked if you would like to create this file. R will look for this file the next time you start R, and if it exists, your session will be recreated just as you had left it.

Importing and saving datasets from CSV files

It is common for public datasets to be stored in text files. Text files can be read on virtually any computer or operating system, which makes the format nearly universal. They can also be exported and imported from and to programs such as Microsoft Excel, providing a quick and easy way to work with spreadsheet data.

A tabular (as in “table”) data file is structured in matrix form, such that each line of text reflects one example, and each example has the same number of features. The feature values on each line are separated by a predefined symbol known as a delimiter. Often, the first line of a tabular data file lists the names of the data columns. This is called a header line.

Perhaps the most common tabular text file format is the comma-separated values (CSV) file, which, as the name suggests, uses the comma as a delimiter. CSV files can be imported to and exported from many common applications. A CSV file representing the medical dataset constructed previously could be stored as:

subject_name,temperature,flu_status,gender,blood_type
John Doe,98.1,FALSE,MALE,O
Jane Doe,98.6,FALSE,FEMALE,AB
Steve Graves,101.4,TRUE,MALE,A

Given a patient data file named pt_data.csv located in the R working directory, the read.csv() function can be used as follows to load the file into R:

> pt_data <- read.csv("pt_data.csv")

This will read the CSV file into a data frame titled pt_data. If your dataset resides outside the R working directory, the full path to the CSV file (for example, "/path/to/mydata.csv") can be used when calling the read.csv() function.

By default, R assumes that the CSV file includes a header line listing the names of the features in the dataset. If a CSV file does not have a header, specify the option header = FALSE as shown in the following command, and R will assign generic feature names by numbering the columns sequentially as V1, V2, and so on:

> pt_data <- read.csv("pt_data.csv", header = FALSE)

As an important historical note, in versions of R prior to 4.0, the read.csv() function automatically converted all character type columns into factors due to a stringsAsFactors parameter that was set to TRUE by default. This feature was occasionally helpful, especially on the smaller and simpler datasets used in the earlier years of R. However, as datasets have become larger and more complex, this feature began to cause more problems than it solved. Now, starting with version 4.0, R sets stringsAsFactors = FALSE by default. If you are certain that every character column in a CSV file is truly a factor, it is possible to convert them using the following syntax:

> pt_data <- read.csv("pt_data.csv", stringsAsFactors = TRUE)

We will set stringsAsFactors = TRUE occasionally throughout the book, when working with datasets in which all character columns are truly factors.

Getting results data out of R can be almost as important as getting it in! To save a data frame to a CSV file, use the write.csv() function. For a data frame named pt_data, simply enter:

> write.csv(pt_data, file = "pt_data.csv", row.names = FALSE)

This will write a CSV file with the name pt_data.csv to the R working folder. The row.names parameter overrides R’s default setting, which is to output row names in the CSV file. Generally, this output is unnecessary and will simply inflate the size of the resulting file.

For more sophisticated control over reading in files, note that read.csv() is a special case of the read.table() function, which can read tabular data in many different forms. This includes other delimited formats such as tab-separated values (TSV) and vertical bar (|) delimited files. For more detailed information on the read.table() family of functions, refer to the R help page using the ?read.table command.

Importing common dataset formats using RStudio

For more complex importation scenarios, the RStudio Desktop software offers a simple interface, which will guide you through the process of writing R code that can be used to load the data into your project. Although it has always been relatively easy to load plaintext data formats like CSV, importing other common analytical data formats like Microsoft Excel (.xls and .xlsx), SAS (.sas7bdat and .xpt), SPSS (.sav and .por), and Stata (.dta) was once a tedious and time-consuming process, requiring knowledge of specific tricks and tools across multiple R packages. Now, the functionality is available via the Import Dataset command near the upper right of the RStudio interface, as shown in Figure 2.1:

Graphical user interface, application  Description automatically generated

Figure 2.1: RStudio’s “Import Dataset” feature provides options to load data from a variety of common formats

Depending on the data format selected, you may be prompted to install R packages that are required for the functionality in question. Behind the scenes, these packages will translate the data format so that it can be used in R. You will then be presented with a dialog box allowing you to choose the options for the data import process and see a live preview of how the data will appear in R as these changes are made.

The following screenshot illustrates the process of importing a Microsoft Excel version of the used cars dataset using the readxl package (https://readxl.tidyverse.org), but the process is similar for any of the dataset formats:

Graphical user interface, application  Description automatically generated

Figure 2.2: The data import dialog provides a “Code Preview” that can be copy-and-pasted into your R code file

The Code Preview in the bottom-right of this dialog provides the R code to perform the importation with the specified options. Selecting the Import button will immediately execute the code; however, a better practice is to copy and paste the code into your R source code file, so that you can re-import the dataset in future sessions.

The read_excel() function RStudio uses to load Excel data creates an R object called a “tibble” rather than a data frame. The differences are so subtle that you may not even notice! However, tibbles are an important R innovation enabling new ways to work with data frames. The tibble and its functionality are discussed in Chapter 12, Advanced Data Preparation.

The RStudio interface has made it easier than ever to work with data in a variety of formats, but more advanced functionality exists for working with large datasets. In particular, if you have data residing in database platforms like Microsoft SQL, MySQL, PostgreSQL, and others, it is possible to connect R to such databases to pull the data into R, or even utilize the database hardware itself to perform big data computations prior to bringing the results into R. Chapter 15, Making Use of Big Data, introduces these techniques and provides instructions for connecting to common databases using RStudio.

   
About the Author
  • Brett Lantz

    Brett Lantz (DataSpelunking) has spent more than 10 years using innovative data methods to understand human behavior. A sociologist by training, Brett was first captivated by machine learning during research on a large database of teenagers' social network profiles. Brett is a DataCamp instructor and a frequent speaker at machine learning conferences and workshops around the world. He is known to geek out about data science applications for sports, autonomous vehicles, foreign language learning, and fashion, among many other subjects, and hopes to one day blog about these subjects at Data Spelunking, a website dedicated to sharing knowledge about the search for insight in data.

    Browse publications by this author
Machine Learning with R - Fourth Edition
Unlock this book and the full library FREE for 7 days
Start now