You're reading from R for Data Science Cookbook (n)
In the previous chapter, we covered how to integrate data from various data sources. However, simply collecting data is not enough; you also have to ensure the quality of the collected data. If the quality of data used is insufficient, the results of the analysis may be misleading due to biased samples or missing values. Moreover, if the collected data is not well structured and shaped, you may find it hard to correlate and investigate the data. Therefore, data preprocessing and preparation is an essential task that you must perform prior to data analysis.
Those of you familiar with how SQL operates may already understand how to use databases to process data. For example, SQL allows users to add new records with the insert
operation, modify data with the update
operation, and remove records with the delete
operation. However, we do not need to move collected data back to the database; R already provides more powerful and convenient preprocessing functions and packages. In this...
The use of a data frame enables the user to select and filter data by row names and column names. As not all imported datasets contain row names and column names, we need to rename this dataset with a built-in naming function.
In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.
Perform the following steps to rename data:
First, download
employees.csv
from the GitHub link https://github.com/ywchiu/rcookbook/raw/master/chapter3/employees.csv:> download.file("https://github.com/ywchiu/rcookbook/raw/master/chapter3/employees.csv", " employees.csv")
Additionally, download
salaries.csv
from the GitHub link https://github.com/ywchiu/rcookbook/raw/master/chapter3/salaries.csv:> download.file("https://github.com/ywchiu/rcookbook/raw/master/chapter3/salaries.csv", "salaries.csv")
Next, read the file into an R session with the
read.csv
function:> employees <- read.csv('employees...
If we do not specify a data type during the import phase, R will automatically assign a type to the imported dataset. However, if the data type assigned is different to the actual type, we may face difficulties in further data manipulation. Thus, data type conversion is an essential step during the preprocessing phase.
Complete the previous recipe and import both employees.csv
and salaries.csv
into an R session. You must also specify column names for these two datasets to be able to perform the following steps.
Perform the following steps to convert the data type:
First, examine the data type of each attribute using the
class
function:> class(employees$birth_date) [1] "factor"
You can also examine types of all attributes using the
str
function:> str(employees) 'data.frame': 10 obs. of 6 variables: $ emp_no : int 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 $ birth_date: Factor w/ 10 levels "1952-04-19","1953-04-20...
After we have converted each data attribute to the proper data type, we may determine that some attributes in employees
and salaries
are in the date format. Thus, we can calculate the number of years between the employees' date of birth and current year to estimate the age of each employee. Here, we will show you how to use some built-in date functions and the lubridate
package to manipulate date format data.
Refer to the previous recipe and convert each attribute of imported data into the correct data type. Also, you have to rename the columns of the employees
and salaries
datasets by following the steps from the Renaming the data variable recipe.
Perform the following steps to work with the date format in employees
and salaries
:
We can add or subtract days on the date format attribute using the following:
> employees$hire_date + 30
We can obtain time differences in days between
hire_date
andbirth_date
using the following:> employees...
For those of you familiar with databases, you may already know how to perform an insert
operation to append a new record to the dataset. Alternatively, you can use an alter
operation to add a new column (attribute) into a table. In R, you can also perform insert
and alter
operations but much more easily. We will introduce the rbind
and cbind
function in this recipe so that you can easily append a new record or new attribute to the current dataset with R.
Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees
and salaries
datasets by following the steps from the Renaming the data variable recipe.
Data filtering is the most common requirement for users who want to analyze partial data of interest rather than the whole dataset. In database operations, we can use a SQL command with a where
clause to subset the data. In R, we can simply use the square bracket to perform filtering.
Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees
and salaries
datasets by following the steps from the Renaming the data variable recipe.
Perform the following steps to filter data:
First, use
head
andtail
to subset the first three rows and last three rows from theemployees
dataset:> head(employees, 3) emp_no birth_date first_name last_name gender hire_date 1 10001 1953-09-02 Georgi Facello M 1986-06-26 2 10002 1964-06-02 Bezalel Simmel F 1985-11-21 3 10003 1959-12-03 Parto Bamford M 1986-08-28 > tail(employees, 3) ...
In the previous recipes, we introduced how to revise and filter datasets. Following these steps almost concludes the data preprocessing and preparation phase. However, we may still find some bad data within our dataset. Thus, we should discard this bad data or unwanted records to prevent it from generating misleading results. Here, we introduce some practical methods to remove this unnecessary data.
Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees
and salaries
datasets by following the steps from the Renaming the data variable recipe.
Perform the following steps to drop an attribute from the current dataset:
First, you can drop the
last_name
column by excludinglast_name
in our filtered subset:> employees <- employees[,-5]
Or, you can assign
NULL
to the attribute you wish to drop:> employees$hire_date <- NULL
To drop rows, you can specify...
Merging data enables us to understand how different data sources relate to each other. The merge
operation in R is similar to the join
operation in a database, which combines fields from two datasets using values that are common to each.
Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees
and salaries
datasets by following the steps from the Renaming the data variable recipe.
Perform the following steps to merge salaries
and employees
:
As
employees
andsalaries
are common inemp_no
, we can merge these two datasets usingemp_no
as the join key:> employees_salary <- merge(employees, salaries, by="emp_no") > head(employees_salary,3) emp_no birth_date first_name last_name salary from_date to_date 1 10001 1953-09-02 Georgi Facello 60117 1986-06-26 1987-06-26 2 10001 1953-09-02 Georgi Facello 62102 1987-06-26 1988-06-25 3 10001...
The power of sorting enables us to view data in an arrangement so that we can analyze the data more efficiently. In a database, we can use an order by
clause to sort data with appointed columns. In R, we can use the order
and sort
functions to place data in an arrangement.
Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees
and salaries
datasets by following the steps from the Renaming the data variable recipe.
Perform the following steps to sort the salaries
dataset:
First, we can use the
sort
function to sort data:> a <- c(5,1,4,3,2,6,3) > sort(a) [1] 1 2 3 3 4 5 6 > sort(a, decreasing=TRUE) [1] 6 5 4 3 3 2 1
Next, we can determine how the
order
function works on the same input vector:> order(a) [1] 2 5 4 7 3 1 6 > order(a, decreasing = TRUE) [1] 6 1 3 4 7 5 2
To sort a data frame by a specific column, we first obtain the ordered...
Reshaping data is similar to creating a contingency table, which enables the user to aggregate data of specific values. The reshape2
package is designed for this specific purpose. Here, we introduce how to use the reshape2
package to transform our dataset from long to wide format with the dcast
function. We also cover how to transform it from wide format back to long format with the melt
function.
Refer to the Merging data recipe and merge employees
and salaries
into employees_salary
.
Perform the following steps to reshape data:
First, we can use the
dcast
function to transform data from long to wide:> wide_salaries <- dcast(salaries, emp_no ~ year(ymd(from_date)), value.var="salary") > wide_salaries[1:3, 1:7] emp_no 1985 1986 1987 1988 1989 1990 1 10001 NA 60117 62102 66074 66596 66961 2 10002 NA NA NA NA NA NA 3 10003 NA NA NA NA NA NA
We can also transform the data by keeping
emp_no
and the formatted...
There are numerous causes behind missing data. For example, it could be the result of typos or data process flaws. However, if there is missing data in our analysis process, the results of the analysis may be misleading. Thus, it is important to detect missing values before proceeding with further analysis.
Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees
and salaries
datasets by following the steps from the Renaming the data variable recipe.
Perform the following steps to detect missing values:
First, we set the
to_date
attribute with a date over2100-01-01
:> salaries[salaries$to_date > "2100-01-01",]
We then change the data with a date over
2100-01-01
to a missing value:> salaries[salaries$to_date > "2100-01-01","to_date"] = NA
Next, we can use the
is.na
function to find which rows contain missing values:> is.na(salaries...
The previous recipe showed us how to detect missing values within the dataset. Though the data with missing values is rather incomplete, we can still adapt a heuristic approach to complete our dataset. Here, we introduce some techniques one can employ to impute missing values.
Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees
and salaries
datasets by following the steps in the Renaming the data variable recipe.
Perform the following steps to impute missing values:
First, we subset user data with
emp_no
equal to10001
:> test.emp <- salaries[salaries$emp_no == 10001,]
Then, we purposely assign
salary
as the missing value of row8
:> test.emp[8,c("salary")] [1] 75286 > test.emp[8,c("salary")] = NA
For the first imputing method, we can remove records with missing values using the
na.omit
function:> na.omit(test.emp)
On the other...