Packt+ | Advance your knowledge in tech

You're reading from R for Data Science Cookbook (n)

Product typeBook

Published inJul 2016

Reading LevelIntermediate

Publisher

ISBN-139781784390815

Edition1st Edition

Languages

Tools

ggplot

Concepts

Data Science

Author (1)

Yu-Wei, Chiu (David Chiu)

Chapter 3. Data Preprocessing and Preparation

This chapter covers the following topics:

Renaming the data variable
Converting data types
Working with the date format
Adding new records
Filtering data
Dropping data
Merging data
Sorting data
Reshaping data
Detecting missing data
Imputing missing data

Introduction

In the previous chapter, we covered how to integrate data from various data sources. However, simply collecting data is not enough; you also have to ensure the quality of the collected data. If the quality of data used is insufficient, the results of the analysis may be misleading due to biased samples or missing values. Moreover, if the collected data is not well structured and shaped, you may find it hard to correlate and investigate the data. Therefore, data preprocessing and preparation is an essential task that you must perform prior to data analysis.

Those of you familiar with how SQL operates may already understand how to use databases to process data. For example, SQL allows users to add new records with the insert operation, modify data with the update operation, and remove records with the delete operation. However, we do not need to move collected data back to the database; R already provides more powerful and convenient preprocessing functions and packages. In this...

Renaming the data variable

The use of a data frame enables the user to select and filter data by row names and column names. As not all imported datasets contain row names and column names, we need to rename this dataset with a built-in naming function.

Getting ready

In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.

How to do it…

Perform the following steps to rename data:

First, download employees.csv from the GitHub link https://github.com/ywchiu/rcookbook/raw/master/chapter3/employees.csv:
```
> download.file("https://github.com/ywchiu/rcookbook/raw/master/chapter3/employees.csv", " employees.csv")
```
Additionally, download salaries.csv from the GitHub link https://github.com/ywchiu/rcookbook/raw/master/chapter3/salaries.csv:
```
> download.file("https://github.com/ywchiu/rcookbook/raw/master/chapter3/salaries.csv", "salaries.csv")
```
Next, read the file into an R session with the read.csv function:
```
> employees <- read.csv('employees...
```

Converting data types

If we do not specify a data type during the import phase, R will automatically assign a type to the imported dataset. However, if the data type assigned is different to the actual type, we may face difficulties in further data manipulation. Thus, data type conversion is an essential step during the preprocessing phase.

Getting ready

Complete the previous recipe and import both employees.csv and salaries.csv into an R session. You must also specify column names for these two datasets to be able to perform the following steps.

How to do it…

Perform the following steps to convert the data type:

First, examine the data type of each attribute using the class function:
```
> class(employees$birth_date)
[1] "factor"
```

You can also examine types of all attributes using the str function:

> str(employees)

'data.frame': 10 obs. of  6 variables:
 $ emp_no    : int  10001 10002 10003 10004 10005 10006 10007 10008 10009 10010
 $ birth_date: Factor w/ 10 levels "1952-04-19","1953-04-20...

Working with the date format

After we have converted each data attribute to the proper data type, we may determine that some attributes in employees and salaries are in the date format. Thus, we can calculate the number of years between the employees' date of birth and current year to estimate the age of each employee. Here, we will show you how to use some built-in date functions and the lubridate package to manipulate date format data.

Getting ready

Refer to the previous recipe and convert each attribute of imported data into the correct data type. Also, you have to rename the columns of the employees and salaries datasets by following the steps from the Renaming the data variable recipe.

How to do it…

Perform the following steps to work with the date format in employees and salaries:

We can add or subtract days on the date format attribute using the following:
```
> employees$hire_date + 30
```
We can obtain time differences in days between hire_date and birth_date using the following:
```
> employees...
```

Adding new records

For those of you familiar with databases, you may already know how to perform an insert operation to append a new record to the dataset. Alternatively, you can use an alter operation to add a new column (attribute) into a table. In R, you can also perform insert and alter operations but much more easily. We will introduce the rbind and cbind function in this recipe so that you can easily append a new record or new attribute to the current dataset with R.

Getting ready

Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees and salaries datasets by following the steps from the Renaming the data variable recipe.

How to do it…

Perform the following steps to add a new record or new variable into the dataset:

First, use rbind to insert a new record to employees:

> employees <- rbind(employees, c(10011, '1960-01-01', 'Jhon', 'Doe', 'M', '1988-01-01'))

We can then reassign the...

Filtering data

Data filtering is the most common requirement for users who want to analyze partial data of interest rather than the whole dataset. In database operations, we can use a SQL command with a where clause to subset the data. In R, we can simply use the square bracket to perform filtering.

Getting ready

How to do it…

Perform the following steps to filter data:

First, use head and tail to subset the first three rows and last three rows from the employees dataset:

> head(employees, 3)
  emp_no birth_date first_name last_name gender  hire_date
1  10001 1953-09-02     Georgi   Facello      M 1986-06-26
2  10002 1964-06-02    Bezalel    Simmel      F 1985-11-21
3  10003 1959-12-03      Parto   Bamford      M 1986-08-28

> tail(employees, 3)
...

Dropping data

In the previous recipes, we introduced how to revise and filter datasets. Following these steps almost concludes the data preprocessing and preparation phase. However, we may still find some bad data within our dataset. Thus, we should discard this bad data or unwanted records to prevent it from generating misleading results. Here, we introduce some practical methods to remove this unnecessary data.

Getting ready

How to do it…

Perform the following steps to drop an attribute from the current dataset:

First, you can drop the last_name column by excluding last_name in our filtered subset:
```
> employees <- employees[,-5]
```
Or, you can assign NULL to the attribute you wish to drop:
```
> employees$hire_date <- NULL
```
To drop rows, you can specify...

Merging data

Merging data enables us to understand how different data sources relate to each other. The merge operation in R is similar to the join operation in a database, which combines fields from two datasets using values that are common to each.

Getting ready

How to do it…

Perform the following steps to merge salaries and employees:

As employees and salaries are common in emp_no, we can merge these two datasets using emp_no as the join key:

> employees_salary  <- merge(employees, salaries, by="emp_no")
> head(employees_salary,3)
  emp_no birth_date first_name last_name salary  from_date    to_date
1  10001 1953-09-02     Georgi   Facello  60117 1986-06-26 1987-06-26
2  10001 1953-09-02     Georgi   Facello  62102 1987-06-26 1988-06-25
3  10001...

Sorting data

The power of sorting enables us to view data in an arrangement so that we can analyze the data more efficiently. In a database, we can use an order by clause to sort data with appointed columns. In R, we can use the order and sort functions to place data in an arrangement.

Getting ready

How to do it…

Perform the following steps to sort the salaries dataset:

First, we can use the sort function to sort data:

> a <- c(5,1,4,3,2,6,3)
> sort(a)
[1] 1 2 3 3 4 5 6
> sort(a, decreasing=TRUE)
[1] 6 5 4 3 3 2 1

Next, we can determine how the order function works on the same input vector:

> order(a)
[1] 2 5 4 7 3 1 6
> order(a, decreasing = TRUE)
[1] 6 1 3 4 7 5 2

To sort a data frame by a specific column, we first obtain the ordered...

Reshaping data

Reshaping data is similar to creating a contingency table, which enables the user to aggregate data of specific values. The reshape2 package is designed for this specific purpose. Here, we introduce how to use the reshape2 package to transform our dataset from long to wide format with the dcast function. We also cover how to transform it from wide format back to long format with the melt function.

Getting ready

Refer to the Merging data recipe and merge employees and salaries into employees_salary.

How to do it…

Perform the following steps to reshape data:

First, we can use the dcast function to transform data from long to wide:

> wide_salaries <- dcast(salaries, emp_no ~ year(ymd(from_date)), value.var="salary")
> wide_salaries[1:3, 1:7]
  emp_no 1985  1986  1987  1988  1989  1990
1  10001   NA 60117 62102 66074 66596 66961
2  10002   NA    NA    NA    NA    NA    NA
3  10003   NA    NA    NA    NA    NA    NA

We can also transform the data by keeping emp_no and the formatted...

Detecting missing data

There are numerous causes behind missing data. For example, it could be the result of typos or data process flaws. However, if there is missing data in our analysis process, the results of the analysis may be misleading. Thus, it is important to detect missing values before proceeding with further analysis.

Getting ready

How to do it…

Perform the following steps to detect missing values:

First, we set the to_date attribute with a date over 2100-01-01:
```
> salaries[salaries$to_date > "2100-01-01",]
```
We then change the data with a date over 2100-01-01 to a missing value:
```
> salaries[salaries$to_date > "2100-01-01","to_date"] = NA
```
Next, we can use the is.na function to find which rows contain missing values:
```
> is.na(salaries...
```

Imputing missing data

The previous recipe showed us how to detect missing values within the dataset. Though the data with missing values is rather incomplete, we can still adapt a heuristic approach to complete our dataset. Here, we introduce some techniques one can employ to impute missing values.

Getting ready

How to do it…

Perform the following steps to impute missing values:

First, we subset user data with emp_no equal to 10001:

> test.emp <- salaries[salaries$emp_no == 10001,]

Then, we purposely assign salary as the missing value of row 8:

> test.emp[8,c("salary")]
[1] 75286
> test.emp[8,c("salary")] = NA

For the first imputing method, we can remove records with missing values using the na.omit function:
```
> na.omit(test.emp)
```
On the other...

The rest of the chapter is locked

You have been reading a chapter from

R for Data Science Cookbook (n)

Published in: Jul 2016Publisher: ISBN-13: 9781784390815

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Yu-Wei, Chiu (David Chiu)

Yu-Wei, Chiu (David Chiu) is the founder of LargitData (www.LargitData.com), a startup company that mainly focuses on providing big data and machine learning products. He has previously worked for Trend Micro as a software engineer, where he was responsible for building big data platforms for business intelligence and customer relationship management systems. In addition to being a start-up entrepreneur and data scientist, he specializes in using Spark and Hadoop to process big data and apply data mining techniques for data analysis. Yu-Wei is also a professional lecturer and has delivered lectures on big data and machine learning in R and Python, and given tech talks at a variety of conferences. In 2015, Yu-Wei wrote Machine Learning with R Cookbook, Packt Publishing. In 2013, Yu-Wei reviewed Bioinformatics with R Cookbook, Packt Publishing. For more information, please visit his personal website at www.ywchiu.com. **********************************Acknowledgement************************************** I have immense gratitude for my family and friends for supporting and encouraging me to complete this book. I would like to sincerely thank my mother, Ming-Yang Huang (Miranda Huang); my mentor, Man-Kwan Shan; the proofreader of this book, Brendan Fisher; Members of LargitData; Data Science Program (DSP); and other friends who have offered their support.
Read more about Yu-Wei, Chiu (David Chiu)

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages