Packt+ | Advance your knowledge in tech

You're reading from Learning R Programming

Product typeBook

Published inOct 2016

Reading LevelBeginner

PublisherPackt

ISBN-139781785889776

Edition1st Edition

Languages

Tools

RStudio

Concepts

Programming Language

Author (1)

Kun Ren

Chapter 12. Data Manipulation

In the previous chapter, you learned the methods used to access different types of databases such as relational databases (SQLite and MySQL) and non-relational databases (MongoDB and Redis). Relational databases usually return data in a tabular form, while non-relational databases may support nested data structures and other features.

Even though the data is loaded into memory, it is usually far from ready for data analysis. Most data at this stage still needs cleaning and transforming, which, in fact, may take a large proportion of time before any statistical model and visualization can be applied. In this chapter, you'll learn about a set of built-in functions and several packages for data manipulation. The packages are extremely powerful. However, to better work with these packages, we need a concrete understanding of the knowledge introduced in the previous chapters.

In this chapter, we'll cover the following topics:

Using basic functions to manipulate data...

Using built-in functions to manipulate data frames

Previously, you learned the basics of data frames. Here, we will review the built-in functions used to filter a data frame. Although a data frame is essentially a list of vectors, we can access it like a matrix since all column vectors are of the same length. To select rows that meet certain conditions, we will supply a logical vector as the first argument of [], while the second is left empty.

In R, these operations can be done with built-in functions. In this section, we will introduce some built-in functions that are most helpful to manipulate data into the form we need as model input or for presentation. Some of the functions or techniques are already presented in the previous chapters.

Most of the code in this section and subsequent sections are based on a group of fictitious data about some products. We will use the readr package to load the data for better handling of column types. If you don't have this package installed, run install...

Using SQL to query data frames via the sqldf package

In the previous chapter, you learned how to compose SQL statements to query data from relational databases such as SQLite and MySQL. Is there a way to directly use SQL to query data frames in R as if these data frames are tables in relational databases? The sqldf package says yes.

This package takes advantage of SQLite, thanks to its lightweight structure and easiness to embed into an R session. Run the following command to install this package if you don't have it:

install.packages("sqldf")

First, let's attach the package, as shown in the following code:

library(sqldf) 
## Loading required package: gsubfn 
## Loading required package: proto 
## Loading required package: RSQLite 
## Loading required package: DBI

Note that when we attach sqldf, a number of other packages are automatically loaded. The sqldf package depends on these packages, because what it does is basically transferring data and converting data...

Using data.table to manipulate data

In the first section, we reviewed some built-in functions used to manipulate data frames. Then, we introduced sqldf, which makes simple data query and summary easier. However, both approaches have their limitations. Using built-in functions can be verbose and slow, and it is not easy to summarize data because SQL is not as powerful as the full spectrum of R functions.

The data.table package provides a powerful enhanced version of data.frame. It is blazing fast and has the ability to handle large data that fits into memory. It invents a natural syntax of data manipulation using []. Run the following command to install the package from CRAN if you don't have it yet:

install.packages("data.table")

Once the package is successfully installed, we will load the package and see what it offers:

library(data.table) 
##  
## Attaching package: 'data.table' 
## The following objects are masked from 'package:reshape2': 
##  
##     dcast...

Using dplyr pipelines to manipulate data frames

Another popular package is dplyr, which invents a grammar of data manipulation. Instead of using the subset function ([]), dplyr defines a set of basic erb functions as the building blocks of data operations and imports a pipeline operator to chain these functions to perform complex multistep tasks.

Run the following code to install dplyr from CRAN if you don't have it yet:

install.packages("dplyr")

First, we will reload the product tables again to reset all data to their original forms:

library(readr) 
product_info <- read_csv("data/product-info.csv") 
product_stats <- read_csv("data/product-stats.csv") 
product_tests <- read_csv("data/product-tests.csv") 
toy_tests <- read_csv("data/product-toy-tests.csv")

Then, we will load the dplyr package:

library(dplyr) 
##  
## Attaching package: 'dplyr' 
## The following objects are masked from 'package:data.table': 
##  
##     between...

Using rlist to work with nested data structures

In the previous chapter, you learned about both relational databases that store data in tables and non-relational databases that support nested data structures. In R, the most commonly used nested data structure is a list object. All previous sections focus on manipulating tabular data. In this section, let's play with the rlist package I developed, which is designed for manipulating non-tabular data.

The design of rlist is very similar to dplyr. It provides mapping, filtering, selecting, sorting, grouping, and aggregating functionality for list objects. Run the following code to install the rlist package from CRAN:

install.packages("rlist")

We have the non-tabular version of the product data stored in data/products.json. In this file, each product has a JSON representation as follows:

{ 
    "id": "T01", 
    "name": "SupCar", 
    "type": "toy", 
    "class": "vehicle", 
    "released": true, 
    "stats"...

Summary

In this chapter, you learned a number of basic functions and various packages for data manipulation. Using built-in functions to manipulate data can be redundant. Several packages are tailored for filtering and aggregating data based on different techniques and philosophies. The sqldf packages use embedded SQLite databases so that we can directly write SQL statements to query data frame in our working environment. On the other hand, data.table provides an enhanced version of data.frame and a powerful syntax, and dplyr defines a grammar of data manipulation by providing a set of pipeline friendly verb functions. The rlist class provides a set of pipeline friendly functions for non-tabular data manipulation. No single package is best for all situations. Each of them represents a way of thinking, and which best fits a certain problem depends on how you understand the problem and your experience of working with data.

Processing data and doing simulation require considerable computing...

The rest of the chapter is locked

You have been reading a chapter from

Learning R Programming

Published in: Oct 2016Publisher: PacktISBN-13: 9781785889776

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Kun Ren

Kun Ren has used R for nearly 4 years in quantitative trading, along with C++ and C#, and he has worked very intensively (more than 8-10 hours every day) on useful R packages that the community does not offer yet. He contributes to packages developed by other authors and reports issues to make things work better. He is also a frequent speaker at R conferences in China and has given multiple talks. Kun also has a great social media presence. Additionally, he has substantially contributed to various projects, which is evident from his GitHub account: https://github.com/renkun-ken https://cn.linkedin.com/in/kun-ren-76027530 http://renkun.me/ http://renkun.me/formattable/ http://renkun.me/pipeR/ http://renkun.me/rlist/
Read more about Kun Ren

Personalised recommendations for you

Based on your interests and search pattern

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

BookSep 2023288 pages

Expert C++

Discover advanced programming techniques, the latest features of C++17 and C++20, and best practices for memory management, debugging, testing, and large-scale application design with Expert C++. Ideal for experienced developers advancing to proficient programmers and building professional-grade C++ applications.

BookAug 2023604 pages

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

BookOct 2023604 pages4

Developer Career Masterplan

Written by industry experts that have spent the last 20+ years helping developers grow their career path towards senior developer positions and beyond. This book provides a comprehensive guide, sharing examples and stories from their global careers. By the end, you’ll have the knowledge to create a clear career progression plan as a technical professional.

BookSep 2023310 pages

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

BookNov 2023434 pages

Python Real-World Projects

Amplify your developer journey by curating a dynamic project portfolio that outshines traditional resumes. Delve into the Python realm through immersive projects, mastering core concepts while constructing comprehensive modules and applications. From data acquisition prowess to impactful data visualization, Python Real-World Projects arms you with essential skills to beat the competition.

BookSep 2023478 pages5

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

BookNov 2023386 pages

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Quantum Computing Algorithms

The book emphasizes intuitive ideas behind quantum algorithms in ways that other books don’t cover, striking a careful balance between no math and too much math. To get the most from this book, you should be comfortable with basic algebra and writing simple computer code. No prior understanding of quantum physics is needed to get started.

BookSep 2023342 pages

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5