Reader small image

You're reading from  R for Data Science Cookbook (n)

Product typeBook
Published inJul 2016
Reading LevelIntermediate
Publisher
ISBN-139781784390815
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Yu-Wei, Chiu (David Chiu)
Yu-Wei, Chiu (David Chiu)
author image
Yu-Wei, Chiu (David Chiu)

Yu-Wei, Chiu (David Chiu) is the founder of LargitData (www.LargitData.com), a startup company that mainly focuses on providing big data and machine learning products. He has previously worked for Trend Micro as a software engineer, where he was responsible for building big data platforms for business intelligence and customer relationship management systems. In addition to being a start-up entrepreneur and data scientist, he specializes in using Spark and Hadoop to process big data and apply data mining techniques for data analysis. Yu-Wei is also a professional lecturer and has delivered lectures on big data and machine learning in R and Python, and given tech talks at a variety of conferences. In 2015, Yu-Wei wrote Machine Learning with R Cookbook, Packt Publishing. In 2013, Yu-Wei reviewed Bioinformatics with R Cookbook, Packt Publishing. For more information, please visit his personal website at www.ywchiu.com. **********************************Acknowledgement************************************** I have immense gratitude for my family and friends for supporting and encouraging me to complete this book. I would like to sincerely thank my mother, Ming-Yang Huang (Miranda Huang); my mentor, Man-Kwan Shan; the proofreader of this book, Brendan Fisher; Members of LargitData; Data Science Program (DSP); and other friends who have offered their support.
Read more about Yu-Wei, Chiu (David Chiu)

Right arrow

Chapter 3. Data Preprocessing and Preparation

This chapter covers the following topics:

  • Renaming the data variable

  • Converting data types

  • Working with the date format

  • Adding new records

  • Filtering data

  • Dropping data

  • Merging data

  • Sorting data

  • Reshaping data

  • Detecting missing data

  • Imputing missing data

Introduction


In the previous chapter, we covered how to integrate data from various data sources. However, simply collecting data is not enough; you also have to ensure the quality of the collected data. If the quality of data used is insufficient, the results of the analysis may be misleading due to biased samples or missing values. Moreover, if the collected data is not well structured and shaped, you may find it hard to correlate and investigate the data. Therefore, data preprocessing and preparation is an essential task that you must perform prior to data analysis.

Those of you familiar with how SQL operates may already understand how to use databases to process data. For example, SQL allows users to add new records with the insert operation, modify data with the update operation, and remove records with the delete operation. However, we do not need to move collected data back to the database; R already provides more powerful and convenient preprocessing functions and packages. In this...

Renaming the data variable


The use of a data frame enables the user to select and filter data by row names and column names. As not all imported datasets contain row names and column names, we need to rename this dataset with a built-in naming function.

Getting ready

In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.

How to do it…

Perform the following steps to rename data:

  1. First, download employees.csv from the GitHub link https://github.com/ywchiu/rcookbook/raw/master/chapter3/employees.csv:

    > download.file("https://github.com/ywchiu/rcookbook/raw/master/chapter3/employees.csv", " employees.csv")
    
  2. Additionally, download salaries.csv from the GitHub link https://github.com/ywchiu/rcookbook/raw/master/chapter3/salaries.csv:

    > download.file("https://github.com/ywchiu/rcookbook/raw/master/chapter3/salaries.csv", "salaries.csv")
    
  3. Next, read the file into an R session with the read.csv function:

    > employees <- read.csv('employees...

Converting data types


If we do not specify a data type during the import phase, R will automatically assign a type to the imported dataset. However, if the data type assigned is different to the actual type, we may face difficulties in further data manipulation. Thus, data type conversion is an essential step during the preprocessing phase.

Getting ready

Complete the previous recipe and import both employees.csv and salaries.csv into an R session. You must also specify column names for these two datasets to be able to perform the following steps.

How to do it…

Perform the following steps to convert the data type:

  1. First, examine the data type of each attribute using the class function:

    > class(employees$birth_date)
    [1] "factor"
    
  2. You can also examine types of all attributes using the str function:

    > str(employees)
    
    'data.frame': 10 obs. of  6 variables:
     $ emp_no    : int  10001 10002 10003 10004 10005 10006 10007 10008 10009 10010
     $ birth_date: Factor w/ 10 levels "1952-04-19","1953-04-20...

Working with the date format


After we have converted each data attribute to the proper data type, we may determine that some attributes in employees and salaries are in the date format. Thus, we can calculate the number of years between the employees' date of birth and current year to estimate the age of each employee. Here, we will show you how to use some built-in date functions and the lubridate package to manipulate date format data.

Getting ready

Refer to the previous recipe and convert each attribute of imported data into the correct data type. Also, you have to rename the columns of the employees and salaries datasets by following the steps from the Renaming the data variable recipe.

How to do it…

Perform the following steps to work with the date format in employees and salaries:

  1. We can add or subtract days on the date format attribute using the following:

    > employees$hire_date + 30
    
  2. We can obtain time differences in days between hire_date and birth_date using the following:

    > employees...

Adding new records


For those of you familiar with databases, you may already know how to perform an insert operation to append a new record to the dataset. Alternatively, you can use an alter operation to add a new column (attribute) into a table. In R, you can also perform insert and alter operations but much more easily. We will introduce the rbind and cbind function in this recipe so that you can easily append a new record or new attribute to the current dataset with R.

Getting ready

Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees and salaries datasets by following the steps from the Renaming the data variable recipe.

How to do it…

Perform the following steps to add a new record or new variable into the dataset:

  1. First, use rbind to insert a new record to employees:

    > employees <- rbind(employees, c(10011, '1960-01-01', 'Jhon', 'Doe', 'M', '1988-01-01'))
    
  2. We can then reassign the...

Filtering data


Data filtering is the most common requirement for users who want to analyze partial data of interest rather than the whole dataset. In database operations, we can use a SQL command with a where clause to subset the data. In R, we can simply use the square bracket to perform filtering.

Getting ready

Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees and salaries datasets by following the steps from the Renaming the data variable recipe.

How to do it…

Perform the following steps to filter data:

  1. First, use head and tail to subset the first three rows and last three rows from the employees dataset:

    > head(employees, 3)
      emp_no birth_date first_name last_name gender  hire_date
    1  10001 1953-09-02     Georgi   Facello      M 1986-06-26
    2  10002 1964-06-02    Bezalel    Simmel      F 1985-11-21
    3  10003 1959-12-03      Parto   Bamford      M 1986-08-28
    
    > tail(employees, 3)
    ...

Dropping data


In the previous recipes, we introduced how to revise and filter datasets. Following these steps almost concludes the data preprocessing and preparation phase. However, we may still find some bad data within our dataset. Thus, we should discard this bad data or unwanted records to prevent it from generating misleading results. Here, we introduce some practical methods to remove this unnecessary data.

Getting ready

Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees and salaries datasets by following the steps from the Renaming the data variable recipe.

How to do it…

Perform the following steps to drop an attribute from the current dataset:

  1. First, you can drop the last_name column by excluding last_name in our filtered subset:

    > employees <- employees[,-5]
    
  2. Or, you can assign NULL to the attribute you wish to drop:

    > employees$hire_date <- NULL
    
  3. To drop rows, you can specify...

Merging data


Merging data enables us to understand how different data sources relate to each other. The merge operation in R is similar to the join operation in a database, which combines fields from two datasets using values that are common to each.

Getting ready

Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees and salaries datasets by following the steps from the Renaming the data variable recipe.

How to do it…

Perform the following steps to merge salaries and employees:

  1. As employees and salaries are common in emp_no, we can merge these two datasets using emp_no as the join key:

    > employees_salary  <- merge(employees, salaries, by="emp_no")
    > head(employees_salary,3)
      emp_no birth_date first_name last_name salary  from_date    to_date
    1  10001 1953-09-02     Georgi   Facello  60117 1986-06-26 1987-06-26
    2  10001 1953-09-02     Georgi   Facello  62102 1987-06-26 1988-06-25
    3  10001...

Sorting data


The power of sorting enables us to view data in an arrangement so that we can analyze the data more efficiently. In a database, we can use an order by clause to sort data with appointed columns. In R, we can use the order and sort functions to place data in an arrangement.

Getting ready

Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees and salaries datasets by following the steps from the Renaming the data variable recipe.

How to do it…

Perform the following steps to sort the salaries dataset:

  1. First, we can use the sort function to sort data:

    > a <- c(5,1,4,3,2,6,3)
    > sort(a)
    [1] 1 2 3 3 4 5 6
    > sort(a, decreasing=TRUE)
    [1] 6 5 4 3 3 2 1
    
  2. Next, we can determine how the order function works on the same input vector:

    > order(a)
    [1] 2 5 4 7 3 1 6
    > order(a, decreasing = TRUE)
    [1] 6 1 3 4 7 5 2
    
  3. To sort a data frame by a specific column, we first obtain the ordered...

Reshaping data


Reshaping data is similar to creating a contingency table, which enables the user to aggregate data of specific values. The reshape2 package is designed for this specific purpose. Here, we introduce how to use the reshape2 package to transform our dataset from long to wide format with the dcast function. We also cover how to transform it from wide format back to long format with the melt function.

Getting ready

Refer to the Merging data recipe and merge employees and salaries into employees_salary.

How to do it…

Perform the following steps to reshape data:

  1. First, we can use the dcast function to transform data from long to wide:

    > wide_salaries <- dcast(salaries, emp_no ~ year(ymd(from_date)), value.var="salary")
    > wide_salaries[1:3, 1:7]
      emp_no 1985  1986  1987  1988  1989  1990
    1  10001   NA 60117 62102 66074 66596 66961
    2  10002   NA    NA    NA    NA    NA    NA
    3  10003   NA    NA    NA    NA    NA    NA
    
  2. We can also transform the data by keeping emp_no and the formatted...

Detecting missing data


There are numerous causes behind missing data. For example, it could be the result of typos or data process flaws. However, if there is missing data in our analysis process, the results of the analysis may be misleading. Thus, it is important to detect missing values before proceeding with further analysis.

Getting ready

Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees and salaries datasets by following the steps from the Renaming the data variable recipe.

How to do it…

Perform the following steps to detect missing values:

  1. First, we set the to_date attribute with a date over 2100-01-01:

    > salaries[salaries$to_date > "2100-01-01",]
    
  2. We then change the data with a date over 2100-01-01 to a missing value:

    > salaries[salaries$to_date > "2100-01-01","to_date"] = NA
    
  3. Next, we can use the is.na function to find which rows contain missing values:

    > is.na(salaries...

Imputing missing data


The previous recipe showed us how to detect missing values within the dataset. Though the data with missing values is rather incomplete, we can still adapt a heuristic approach to complete our dataset. Here, we introduce some techniques one can employ to impute missing values.

Getting ready

Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees and salaries datasets by following the steps in the Renaming the data variable recipe.

How to do it…

Perform the following steps to impute missing values:

  1. First, we subset user data with emp_no equal to 10001:

    > test.emp <- salaries[salaries$emp_no == 10001,]
    
  2. Then, we purposely assign salary as the missing value of row 8:

    > test.emp[8,c("salary")]
    [1] 75286
    > test.emp[8,c("salary")] = NA
    
  3. For the first imputing method, we can remove records with missing values using the na.omit function:

    > na.omit(test.emp)
    
  4. On the other...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
R for Data Science Cookbook (n)
Published in: Jul 2016Publisher: ISBN-13: 9781784390815
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Yu-Wei, Chiu (David Chiu)

Yu-Wei, Chiu (David Chiu) is the founder of LargitData (www.LargitData.com), a startup company that mainly focuses on providing big data and machine learning products. He has previously worked for Trend Micro as a software engineer, where he was responsible for building big data platforms for business intelligence and customer relationship management systems. In addition to being a start-up entrepreneur and data scientist, he specializes in using Spark and Hadoop to process big data and apply data mining techniques for data analysis. Yu-Wei is also a professional lecturer and has delivered lectures on big data and machine learning in R and Python, and given tech talks at a variety of conferences. In 2015, Yu-Wei wrote Machine Learning with R Cookbook, Packt Publishing. In 2013, Yu-Wei reviewed Bioinformatics with R Cookbook, Packt Publishing. For more information, please visit his personal website at www.ywchiu.com. **********************************Acknowledgement************************************** I have immense gratitude for my family and friends for supporting and encouraging me to complete this book. I would like to sincerely thank my mother, Ming-Yang Huang (Miranda Huang); my mentor, Man-Kwan Shan; the proofreader of this book, Brendan Fisher; Members of LargitData; Data Science Program (DSP); and other friends who have offered their support.
Read more about Yu-Wei, Chiu (David Chiu)