Mastering Data analysis with R

Chapter 1. Hello, Data!

Most projects in R start with loading at least some data into the running R session. As R supports a variety of file formats and database backend, there are several ways to do so. In this chapter, we will not deal with basic data structures, which are already familiar to you, but will concentrate on the performance issue of loading larger datasets and dealing with special file formats.

Note

For a quick overview on the standard tools and to refresh your knowledge on importing general data, please see Chapter 7 of the official An Introduction to R manual of CRAN at http://cran.r-project.org/doc/manuals/R-intro.html#Reading-data-from-files or Rob Kabacoff's Quick-R site, which offers keywords and cheat-sheets for most general tasks in R at http://www.statmethods.net/input/importingdata.html. For further materials, please see the References section in the Appendix.

Although R has its own (serialized) binary RData and rds file formats, which are extremely convenient to use for all R users as these also store R object meta-information in an efficient way, most of the time we have to deal with other input formats—provided by our employer or client.

One of the most popular data file formats is flat files, which are simple text files in which the values are separated by white-space, the pipe character, commas, or more often by semi-colon in Europe. This chapter will discuss several options R has to offer to load these kinds of documents, and we will benchmark which of these is the most efficient approach to import larger files.

Sometimes we are only interested in a subset of a dataset; thus, there is no need to load all the data from the sources. In such cases, database backend can provide the best performance, where the data is stored in a structured way preloaded on our system, so we can query any subset of that with simple and efficient commands. The second section of this chapter will focus on the three most popular databases (MySQL, PostgreSQL, and Oracle Database), and how to interact with those in R.

Besides some other helper tools and a quick overview on other database backend, we will also discuss how to load Excel spreadsheets into R—without the need to previously convert those to text files in Excel or Open/LibreOffice.

Of course this chapter is not just about data file formats, database connections, and such boring internals. But please bear in mind that data analytics always starts with loading data. This is unavoidable, so that our computer and statistical environment know the structure of the data before doing some real analytics.

Loading text files of a reasonable size

The title of this chapter might also be Hello, Big Data!, as now we concentrate on loading relatively large amount of data in an R session. But what is Big Data, and what amount of data is problematic to handle in R? What is reasonable size?

R was designed to process data that fits in the physical memory of a single computer. So handling datasets that are smaller than the actual accessible RAM should be fine. But please note that the memory required to process data might become larger while doing some computations, such as principal component analysis, which should be also taken into account. I will refer to this amount of data as reasonable sized datasets.

Loading data from text files is pretty simple with R, and loading any reasonable sized dataset can be achieved by calling the good old read.table function. The only issue here might be the performance: how long does it take to read, for example, a quarter of a million rows of data? Let's see:

> library('hflights')
> write.csv(hflights, 'hflights.csv', row.names = FALSE)

Note

As a reminder, please note that all R commands and the returned output are formatted as earlier in this book. The commands starts with > on the first line, and the remainder of multi-line expressions starts with +, just as in the R console. To copy and paste these commands on your machine, please download the code examples from the Packt homepage. For more details, please see the What you need for this book section in the Preface.

Yes, we have just written an 18.5 MB text file to your disk from the hflights package, which includes some data on all flights departing from Houston in 2011:

> str(hflights)
'data.frame':  227496 obs. of  21 variables:
 $ Year             : int  2011 2011 2011 2011 2011 2011 2011 ...
 $ Month            : int  1 1 1 1 1 1 1 1 1 1 ...
 $ DayofMonth       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ DayOfWeek        : int  6 7 1 2 3 4 5 6 7 1 ...
 $ DepTime          : int  1400 1401 1352 1403 1405 1359 1359 ...
 $ ArrTime          : int  1500 1501 1502 1513 1507 1503 1509 ...
 $ UniqueCarrier    : chr  "AA" "AA" "AA" "AA" ...
 $ FlightNum        : int  428 428 428 428 428 428 428 428 428 ...
 $ TailNum          : chr  "N576AA" "N557AA" "N541AA" "N403AA" ...
 $ ActualElapsedTime: int  60 60 70 70 62 64 70 59 71 70 ...
 $ AirTime          : int  40 45 48 39 44 45 43 40 41 45 ...
 $ ArrDelay         : int  -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
 $ DepDelay         : int  0 1 -8 3 5 -1 -1 -5 43 43 ...
 $ Origin           : chr  "IAH" "IAH" "IAH" "IAH" ...
 $ Dest             : chr  "DFW" "DFW" "DFW" "DFW" ...
 $ Distance         : int  224 224 224 224 224 224 224 224 224 ...
 $ TaxiIn           : int  7 6 5 9 9 6 12 7 8 6 ...
 $ TaxiOut          : int  13 9 17 22 9 13 15 12 22 19 ...
 $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CancellationCode : chr  "" "" "" "" ...
 $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...

Note

The hflights package provides an easy way to load a subset of the huge Airline Dataset of the Research and Innovation Technology Administration at the Bureau of Transportation Statistics. The original database includes the scheduled and actual departure/arrival times of all US flights along with some other interesting information since 1987, and is often used to demonstrate machine learning and Big Data technologies. For more details on the dataset, please see the column description and other meta-data at http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0.

We will use this 21-column data to benchmark data import times. For example, let's see how long it takes to import the CSV file with read.csv:

> system.time(read.csv('hflights.csv'))
   user  system elapsed 
  1.730   0.007   1.738

It took a bit more than one and a half seconds to load the data from an SSD here. It's quite okay, but we can achieve far better results by identifying then specifying the classes of the columns instead of calling the default type.convert (see the docs in read.table for more details or search on StackOverflow, where the performance of read.csv seems to be a rather frequent and popular question):

> colClasses <- sapply(hflights, class)
> system.time(read.csv('hflights.csv', colClasses = colClasses))
   user  system elapsed 
  1.093   0.000   1.092

It's much better! But should we trust this one observation? On our way to mastering data analysis in R, we should implement some more reliable tests—by simply replicating the task n times and providing a summary on the results of the simulation. This approach provides us with performance data with multiple observations, which can be used to identify statistically significant differences in the results. The microbenchmark package provides a nice framework for such tasks:

> library(microbenchmark)
> f <- function() read.csv('hflights.csv')
> g <- function() read.csv('hflights.csv', colClasses = colClasses,
+                        nrows = 227496, comment.char = '')
> res <- microbenchmark(f(), g())
> res
Unit: milliseconds
 expr       min        lq   median       uq      max neval
  f() 1552.3383 1617.8611 1646.524 1708.393 2185.565   100
  g()  928.2675  957.3842  989.467 1044.571 1284.351   100

So we defined two functions: f stands for the default settings of read.csv while, in the g function, we passed the aforementioned column classes along with two other parameters for increased performance. The comment.char argument tells R not to look for comments in the imported data file, while the nrows parameter defined the exact number of rows to read from the file, which saves some time and space on memory allocation. Setting stringsAsFactors to FALSE might also speed up importing a bit.

Note

Identifying the number of lines in the text file could be done with some third-party tools, such as wc on Unix, or a slightly slower alternative would be the countLines function from the R.utils package.

But back to the results. Let's also visualize the median and related descriptive statistics of the test cases, which was run 100 times by default:

> boxplot(res, xlab  = '',
+   main = expression(paste('Benchmarking ', italic('read.table'))))

The difference seems to be significant (please feel free to do some statistical tests to verify that), so we made a 50+ percent performance boost simply by fine-tuning the parameters of read.table.

Data files larger than the physical memory

Loading a larger amount of data into R from CSV files that would not fit in the memory could be done with custom packages created for such cases. For example, both the sqldf package and the ff package have their own solutions to load data from chunk to chunk in a custom data format. The first uses SQLite or another SQL-like database backend, while the latter creates a custom data frame with the ffdf class that can be stored on disk. The bigmemory package provides a similar approach. Usage examples (to be benchmarked) later:

> library(sqldf)
> system.time(read.csv.sql('hflights.csv'))
   user  system elapsed 
  2.293   0.090   2.384 
> library(ff)
> system.time(read.csv.ffdf(file = 'hflights.csv'))
   user  system elapsed 
  1.854   0.073   1.918 
> library(bigmemory)
> system.time(read.big.matrix('hflights.csv', header = TRUE))
   user  system elapsed 
  1.547   0.010   1.559

Please note that the header defaults to FALSE with read.big.matrix from the bigmemory package, so be sure to read the manual of the referenced functions before doing your own benchmarks. Some of these functions also support performance tuning like read.table. For further examples and use cases, please see the Large memory and out-of-memory data section of the High-Performance and Parallel Computing with R CRAN Task View at http://cran.r-project.org/web/views/HighPerformanceComputing.html.

Benchmarking text file parsers

Another notable alternative for handling and loading reasonable sized data from flat files to R is the data.table package. Although it has a unique syntax differing from the traditional S-based R markup, the package comes with great documentation, vignettes, and case studies on the indeed impressive speedup it can offer for various database actions. Such uses cases and examples will be discussed in the Chapter 3, Filtering and Summarizing Data and Chapter 4, Restructuring Data.

The package ships a custom R function to read text files with improved performance:

> library(data.table)
> system.time(dt <- fread('hflights.csv'))
   user  system elapsed 
  0.153   0.003   0.158

Loading the data was extremely quick compared to the preceding examples, although it resulted in an R object with a custom data.table class, which can be easily transformed to the traditional data.frame if needed:

> df <- as.data.frame(dt)

Or by using the setDF function, which provides a very fast and in-place method of object conversion without actually copying the data in the memory. Similarly, please note:

> is.data.frame(dt)
[1] TRUE

This means that a data.table object can fall back to act as a data.frame for traditional usage. Leaving the imported data as is or transforming it to data.frame depends on the latter usage. Aggregating, merging, and restructuring data with the first is faster compared to the standard data frame format in R. On the other hand, the user has to learn the custom syntax of data.table—for example, DT[i, j, by] stands for "from DT subset by i, then do j grouped by by". We will discuss it later in the Chapter 3, Filtering and Summarizing Data.

Now, let's compare all the aforementioned data import methods: how fast are they? The final winner seems to be fread from data.table anyway. First, we define some methods to be benchmarked by declaring the test functions:

> .read.csv.orig   <- function() read.csv('hflights.csv')
> .read.csv.opt    <- function() read.csv('hflights.csv',
+     colClasses = colClasses, nrows = 227496, comment.char = '',
+     stringsAsFactors = FALSE)
> .read.csv.sql    <- function() read.csv.sql('hflights.csv')
> .read.csv.ffdf   <- function() read.csv.ffdf(file = 'hflights.csv')
> .read.big.matrix <- function() read.big.matrix('hflights.csv',
+     header = TRUE)
> .fread           <- function() fread('hflights.csv')

Now, let's run all these functions 10 times each instead of several hundreds of iterations like previously—simply to save some time:

> res <- microbenchmark(.read.csv.orig(), .read.csv.opt(),
+   .read.csv.sql(), .read.csv.ffdf(), .read.big.matrix(), .fread(),
+   times = 10)

And print the results of the benchmark with a predefined number of digits:

> print(res, digits = 6)
Unit: milliseconds
               expr      min      lq   median       uq      max neval
   .read.csv.orig() 2109.643 2149.32 2186.433 2241.054 2421.392    10
    .read.csv.opt() 1525.997 1565.23 1618.294 1660.432 1703.049    10
    .read.csv.sql() 2234.375 2265.25 2283.736 2365.420 2599.062    10
   .read.csv.ffdf() 1878.964 1901.63 1947.959 2015.794 2078.970    10
 .read.big.matrix() 1579.845 1603.33 1647.621 1690.067 1937.661    10
           .fread()  153.289  154.84  164.994  197.034  207.279    10

Please note that now we were dealing with datasets fitting in actual physical memory, and some of the benchmarked packages are designed and optimized for far larger databases. So it seems that optimizing the read.table function gives a great performance boost over the default settings, although if we are after really fast importing of reasonable sized data, using the data.table package is the optimal solution.

Loading a subset of text files

Sometimes we only need some parts of the dataset for an analysis, stored in a database backend or in flat files. In such situations, loading only the relevant subset of the data frame will result in much more speed improvement compared to any performance tweaks and custom packages discussed earlier.

Let's imagine we are only interested in flights to Nashville, where the annual useR! conference took place in 2012. This means we need only those rows of the CSV file where the Dest equals BNA (this International Air Transport Association airport code stands for Nashville International Airport).

Instead of loading the whole dataset in 160 to 2,000 milliseconds (see the previous section) and then dropping the unrelated rows (see in Chapter 3, Filtering and Summarizing Data), let's see the possible ways of filtering the data while loading it.

The already mentioned sqldf package can help with this task by specifying a SQL statement to be run on the temporary SQLite database created for the importing task:

> df <- read.csv.sql('hflights.csv',
+   sql = "select * from file where Dest = '\"BNA\"'")

This sql argument defaults to "select * from file", which means loading all fields of each row without any filters. Now we extended that with a filter statement. Please note that in our updated SQL statements, we also added the double quotes to the search term, as sqldf does not automatically recognize the quotes as special; it regards them as part of the fields. One may overcome this issue also by providing a custom filter argument, such as the following example on Unix-like systems:

> df <- read.csv.sql('hflights.csv',
+   sql = "select * from file where Dest = 'BNA'",
+   filter = 'tr -d ^\\" ')

The resulting data frame holds only 3,481 observations out of the 227,496 cases in the original dataset, and filtering inside the temporary SQLite database of course speeds up data importing a bit:

> system.time(read.csv.sql('hflights.csv'))
   user  system elapsed 
  2.117   0.070   2.191 
> system.time(read.csv.sql('hflights.csv',
+   sql = "select * from file where Dest = '\"BNA\"'"))
   user  system elapsed 
  1.700   0.043   1.745

The slight improvement is due to the fact that both R commands first loaded the CSV file to a temporary SQLite database; this process of course takes some time and cannot be eliminated from this process. To speed up this part of the evaluation, you can specify dbname as NULL for a performance boost. This way, the SQLite database would be created in memory instead of a tempfile, which might not be an optimal solution for larger datasets.

Filtering flat files before loading to R

Is there a faster or smarter way to load only a portion of such a text file? One might apply some regular expression-based filtering on the flat files before passing them to R. For example, grep or ack might be a great tool to do so in a Unix environment, but it's not available by default on Windows machines, and parsing CSV files by regular expressions might result in some unexpected side-effects as well. Believe me, you never want to write a CSV, JSON, or XML parser from scratch!

Anyway, a data scientist nowadays should be a real jack-of-all-trades when it comes to processing data, so here comes a quick and dirty example to show how one could read the filtered data in less than 100 milliseconds:

> system.time(system('cat hflights.csv | grep BNA', intern = TRUE))
   user  system elapsed 
  0.040   0.050   0.082

Well, that's a really great running time compared to any of our previous results! But what if we want to filter for flights with an arrival delay of more than 13.5 minutes?

Another way, and probably a more maintainable approach, would be to first load the data into a database backend, and query that when any subset of the data is needed. This way we could for example, simply populate a SQLite database in a file only once, and then later we could fetch any subsets in a fragment of read.csv.sql's default run time.

So let's create a persistent SQLite database:

> sqldf("attach 'hflights_db' as new")

This command has just created a file named to hflights_db in the current working directory. Next, let's create a table named hflights and populate the content of the CSV file to the database created earlier:

> read.csv.sql('hflights.csv',
+   sql = 'create table hflights as select * from file',
+   dbname = 'hflights_db')

No benchmarking was made so far, as these steps will be run only once, while the queries for sub-parts of the dataset will probably run multiple times later:

> system.time(df <- sqldf(
+   sql = "select * from hflights where Dest = '\"BNA\"'",
+   dbname = "hflights_db"))
   user  system elapsed 
  0.070   0.027   0.097

And we have just loaded the required subset of the database in less than 100 milliseconds! But we can do a lot better if we plan to often query the persistent database: why not dedicate a real database instance for our dataset instead of a simple file-based and server-less SQLite backend?

Loading data from databases

The great advantage of using a dedicated database backend instead of loading data from the disk on demand is that databases provide:

Faster access to the whole or selected parts of large tables
Powerful and quick ways to aggregate and filter data before loading it to R
Infrastructure to store data in a relational, more structured scheme compared to the traditional matrix model of spreadsheets and R objects
Procedures to join and merge related data
Concurrent and network access from multiple clients at the same time
Security policies and limits to access the data
A scalable and configurable backend to store data

The DBI package provides a database interface, a communication channel between R and various relational database management systems (RDBMS), such as MySQL, PostgreSQL, MonetDB, Oracle, and for example Open Document Databases, and so on. There is no real need to install the package on its own because, acting as an interface, it will be installed anyway as a dependency, if needed.

Connecting to a database and fetching data is pretty similar with all these backends, as all are based on the relational model and using SQL to manage and query data. Please be advised that there are some important differences between the aforementioned database engines and that several more open-source and commercial alternatives also exist. But we will not dig into the details on how to choose a database backend or how to build a data warehouse and extract, transform, and load (ETL) workflows, but we will only concentrate on making connections and managing data from R.

Note

SQL, originally developed at IBM, with its more than 40 years of history, is one of the most important programming languages nowadays—with various dialects and implementations. Being one of the most popular declarative languages all over the world, there are many online tutorials and free courses to learn how to query and manage data with SQL, which is definitely one of the most important tools in every data scientist's Swiss army knife.

So, besides R, it's really worth knowing your way around RDBMS, which are extremely common in any industry you may be working at as a data analyst or in a similar position.

Setting up the test environment

Database backends usually run on servers remote from the users doing data analysis, but for testing purposes, it might be a good idea to install local instances on the machine running R. As the installation process can be extremely different on various operating systems, we will not enter into any details of the installation steps, but we will rather refer to where the software can be downloaded from and some further links to great resources and documentation for installation.

Please note that installing and actually trying to load data from these databases is totally optional and you do not have to follow each step—the rest of the book will not depend on any database knowledge or prior experience with databases. On the other hand, if you do not want to mess your workspace with temporary installation of multiple database applications for testing purposes, using virtual machines might be an optimal workaround. Oracle's VirtualBox provides a free and easy way of running multiple virtual machines with their dedicated operating system and userspace.

Note

For detailed instructions on how to download then import a VirtualBox image, see the Oracle section.

This way you can quickly deploy a fully functional, but disposable, database environment to test-drive the following examples of this chapter. In the following image, you can see VirtualBox with four installed virtual machines, of which three are running in the background to provide some database backends for testing purposes:

Note

VirtualBox can be installed by your operating system's package manager on Linux or by downloading the installation binary/sources from https://www.virtualbox.org/wiki/Downloads. For detailed and operating-system specific installation information, please refer to the Chapter 2, Installation details of the manual: http://www.virtualbox.org/manual/.

Nowadays, setting up and running a virtual machine is really intuitive and easy; basically you only need a virtual machine image to be loaded and launched. Some virtual machines, so called appliances, include the operating system, with a number of further software usually already configured to work, for simple, easy and quick distribution.

Tip

Once again, if you do not enjoy installing and testing new software or spending time on learning about the infrastructure empowering your data needs, the following steps are not necessary and you can freely skip these optional tasks primarily described for full-stack developers/data scientists.

Such pre-configured virtual machines to be run on any computer can be downloaded from various providers on the Internet in multiple file formats, such as OVF or OVA. General purpose VirtualBox virtual appliances can be downloaded for example from http://virtualboximages.com/vdi/index or http://virtualboxes.org/images/.

Note

Virtual appliances should be imported in VirtualBox, while non-OVF/OVA disk images should be attached to newly created virtual machines; thus, some extra manual configuration might also be needed.

Oracle also has a repository with a bunch of useful virtual images for data scientist apprentices and other developers at http://www.oracle.com/technetwork/community/developer-vm/index.html, with for example the Oracle Big Data Lite VM developer virtual appliance featuring the following most important components:

Oracle Database
Apache Hadoop and various tools in Cloudera distribution
The Oracle R Distribution
Build on Oracle Enterprise Linux

Disclaimer: Oracle wouldn't be my first choice personally, but they did a great job with their platform-independent virtualization environment, just like with providing free developer VMs based on their commercial products. In short, it's definitely worth using the provided Oracle tools.

Note

If you cannot reach your installed virtual machines on the network, please update your network settings to use Host-only adapter if no Internet connection is needed, or Bridged networking for a more robust setup. The latter setting will reserve an extra IP on your local network for the virtual machine; this way, it becomes accessible easily. Please find more details and examples with screenshots in the Oracle database section.

Another good source of virtual appliances created for open-source database engines is the Turnkey GNU/Linux repository at http://www.turnkeylinux.org/database. These images are based on Debian Linux, are totally free to use, and currently support the MySQL, PostgreSQL, MongoDB, and CouchDB databases.

A great advantage of the Turnkey Linux media is that it includes only open-source, free software and non-proprietary stuff. Besides, the disk images are a lot smaller and include only the required components for one dedicated database engine. This also results in far faster installation with less overhead in terms of the required disk and memory space.

Further similar virtual appliances are available at http://www.webuzo.com/sysapps/databases with a wider range of database backends, such as Cassandra, HBase, Neo4j, Hypertable, or Redis, although some of the Webuzo appliances might require a paid subscription for deployment.

And as the new cool being Docker, I even more suggest you to get familiar with its concept on deploying software containers incredibly fast. Such container can be described as a standalone filesystem including the operating system, libraries, tools, data and so is based on abstraction layers of Docker images. In practice this means that you can fire up a database including some demo data with a one-liner command on your localhost, and developing such custom images is similarly easy. Please see some simple examples and further references at my R and Pandoc-related Docker images described at https://github.com/cardcorp/card-rocker.

MySQL and MariaDB

MySQL is the most popular open-source database engine all over the world based on the number of mentions, job offers, Google searches, and so on, summarized by the DB-Engines Ranking: http://db-engines.com/en/ranking. Mostly used in Web development, the high popularity is probably due to the fact that MySQL is free, platform-independent, and relatively easy to set up and configure—just like its drop-in replacement fork called MariaDB.

Note

MariaDB is a community-developed, fully open-source fork of MySQL, started and led by the founder of MySQL, Michael Widenius. It was later merged with SkySQL; thus further ex-MySQL executives and investors joined the fork. MariaDB was created after Sun Microsystems bought MySQL, currently owned by Oracle, and the development of the database engine changed.

We will refer to both engines as MySQL in the book to keep it simple, as MariaDB can be considered as a drop-in replacement for MySQL, so please feel free to reproduce the following examples with either MySQL or MariaDB.

Although the installation of a MySQL server is pretty straightforward on most operating systems (https://dev.mysql.com/downloads/mysql/), one might rather prefer to have the database installed in a virtual machine. Turnkey Linux provides small but fully configured, virtual appliances for free: http://www.turnkeylinux.org/mysql.

R provides multiple ways to query data from a MySQL database. One option is to use the RMySQL package, which might be a bit tricky for some users to install. If you are on Linux, please be sure to install the development packages of MySQL along with the MySQL client, so that the package can compile on your system. And, as there are no binary packages available on CRAN for Windows installation due to the high variability of MySQL versions, Windows users should also compile the package from source:

> install.packages('RMySQL', type = 'source')

Windows users might find the following blog post useful about the detailed installation steps: http://www.ahschulz.de/2013/07/23/installing-rmysql-under-windows/.

Note

For the sake of simplicity, we will refer to the MySQL server as localhost listening on the default 3306 port; user will stand as user and password as password in all database connections. We will work with the hflights table in the hflights_db database, just like in the SQLite examples a few pages earlier. If you are working in a remote or virtual server, please modify the host, username, and so on arguments of the following code examples accordingly.

After successfully installing and starting the MySQL server, we have to set up a test database, which we could later populate in R. To this end, let us start the MySQL command-line tool to create the database and a test user.

Please note that the following example was run on Linux, and a Windows user might have to also provide the path and probably the exe file extension to start the MySQL command-line tool:

This quick session can be seen in the previous screenshot, where we first connected to the MySQL server in the command-line as the root (admin) user. Then we created a database named hflights_db, and granted all privileges and permissions of that database to a new user called user with the password set to password. Then we simply verified whether we could connect to the database with the newly created user, and we exited the command-line MySQL client.

To load data from a MySQL database into R, first we have to connect and also often authenticate with the server. This can be done with the automatically loaded DBI package when attaching RMySQL:

> library(RMySQL)
Loading required package: DBI
> con <- dbConnect(dbDriver('MySQL'),
+   user = 'user', password = 'password', dbname = 'hflights_db')

Now we can refer to our MySQL connection as con, where we want to deploy the hflights dataset for later access:

> dbWriteTable(con, name = 'hflights', value = hflights)
[1] TRUE
> dbListTables(con)
[1] "hflights"

The dbWriteTable function wrote the hflights data frame with the same name to the previously defined connection. The latter command shows all the tables in the currently used databases, equivalent to the SHOW TABLES SQL command. Now that we have our original CVS file imported to MySQL, let's see how long it takes to read the whole dataset:

> system.time(dbReadTable(con, 'hflights'))
   user  system elapsed 
  0.993   0.000   1.058

Or we can do so with a direct SQL command passed to dbGetQuery from the same DBI package:

> system.time(dbGetQuery(con, 'select * from hflights'))
   user  system elapsed 
  0.910   0.000   1.158

And, just to keep further examples simpler, let's get back to the sqldf package, which stands for "SQL select on data frames". As a matter of fact, sqldf is a convenient wrapper around DBI's dbSendQuery function with some useful defaults, and returns data.frame. This wrapper can query various database engines, such as SQLite, MySQL, H2, or PostgreSQL, and defaults to the one specified in the global sqldf.driver option; or, if that's NULL, it will then check if any R packages have been loaded for the aforementioned backends.

As we have already loaded RMySQL, now sqldf will default to using MySQL instead of SQLite. But we still have to specify which connection to use; otherwise the function will try to open a new one—without any idea about our complex username and password combination, not to mention the mysterious database name. The connection can be passed in each sqldf expression or defined once in a global option:

> options('sqldf.connection' = con)
> system.time(sqldf('select * from hflights'))
   user  system elapsed 
  0.807   0.000   1.014

The difference in the preceding three versions of the same task does not seem to be significant. That 1-second timing seems to be a pretty okay result compared to our previously tested methods—although loading the whole dataset with data.table still beats this result. What about if we only need a subset of the dataset? Let's fetch only those flights ending in Nashville, just like in our previous SQLite example:

> system.time(sqldf('SELECT * FROM hflights WHERE Dest = "BNA"'))
   user  system elapsed 
  0.000   0.000   0.281

This does not seem to be very convincing compared to our previous SQLite test, as the latter could reproduce the same result in less than 100 milliseconds. But please also note that that both the user and system elapsed times are zero, which was not the case with SQLite.

Note

The returned elapsed time by system.time means the number of milliseconds passed since the start of the evaluation. The user and system times are a bit trickier to understand; they are reported by the operating system. More or less, user means the CPU time spent by the called process (like R or the MySQL server), while system reports the CPU time required by the kernel and other operating system processes (such as opening a file for reading). See ?proc.time for further details.

This means that no CPU time was used at all to return the required subset of data, which took almost 100 milliseconds with SQLite. How is it possible? What if we index the database on Dest?

> dbSendQuery(con, 'CREATE INDEX Dest_idx ON hflights (Dest(3));')

This SQL query stands for creating an index named Dest_idx in our table based on the Dest column's first three letters.

Note

SQL index can seriously boost the performance of a SELECT statement with WHERE clauses, as MySQL this way does not have to read through the entire database to match each row, but it can determine the position of the relevant search results. This performance boost becomes more and more spectacular with larger databases, although it's also worth mentioning that indexing only makes sense if subsets of data are queried most of the time. If most or all data is needed, sequential reads would be faster.

Live example:

> system.time(sqldf('SELECT * FROM hflights WHERE Dest = "BNA"'))
   user  system elapsed 
  0.024   0.000   0.034

It seems to be a lot better! Well, of course, we could have also indexed the SQLite database, not just the MySQL instance. To test it again, we have to revert the default sqldf driver to SQLite, which was overridden by loading the RMySQL package:

> options(sqldf.driver = 'SQLite')
> sqldf("CREATE INDEX Dest_idx ON hflights(Dest);",
+   dbname = "hflights_db"))
NULL
> system.time(sqldf("select * from hflights where
+   Dest = '\"BNA\"'", dbname = "hflights_db"))
   user  system elapsed 
  0.034   0.004   0.036

So it seems that both database engines are capable of returning the required subset of data in a fraction of a second, which is a lot better even compared to what we achieved with the impressive data.table before.

Although SQLite proved to be faster than MySQL in some earlier examples, there are many reasons to choose the latter in most situations. First, SQLite is a file-based database, which simply means that the database should be on a filesystem attached to the computer running R. This usually means having the SQLite database and the running R session on the same computer. Similarly, MySQL can handle larger amount of data; it has user management and rule-based control on what they can do, and concurrent access to the same dataset. The smart data scientist knows how to choose his weapon—depending on the task, another database backend might be the optimal solution. Let's see what other options we have in R!

PostgreSQL

While MySQL is said to be the most popular open-source relational database management system, PostgreSQL is famous for being "the world's most advanced open source database". This means that PostgreSQL is often considered to have more features compared to the simpler but faster MySQL, including analytic functions, which has led to PostgreSQL often being described as the open-source version of Oracle.

This sounds rather funny now, as Oracle owns MySQL today. So a bunch of things have changed in the past 20-30 years of RDBMS history, and PostgreSQL is not so slow any more. On the other hand, MySQL has also gained some nice new features—for example MySQL also became ACID-compliant with the InnoDB engine, allowing rollback to previous states of the database. There are some other differences between the two popular database servers that might support choosing either of them. Now let's see what happens if our data provider has a liking for PostgreSQL instead of MySQL!

Installing PostgreSQL is similar to MySQL. One may install the software with the operating system's package manager, download a graphical installer from http://www.enterprisedb.com/products-services-training/pgdownload, or run a virtual appliance with, for example, the free Turnkey Linux, which provides a small but fully configured disk image for free at http://www.turnkeylinux.org/postgresql.

Tip

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

After successfully installing and starting the server, let's set up the test database—just like we did after the MySQL installation:

The syntax is a bit different in some cases, and we have used some command-line tools for the user and database creation. These helper programs are shipped with PostgreSQL by default, and MySQL also have some similar functionality with mysqladmin.

After setting up the initial test environment, or if we already have a working database instance to connect, we can repeat the previously described data management tasks with the help of the RPostgreSQL package:

> library(RPostgreSQL)
Loading required package: DBI

Note

If your R session starts to throw strange error messages in the following examples, it's highly possible that the loaded R packages are conflicting. You could simply start a clean R session, or detach the previously attached packages—for example, detach('package:RMySQL', unload = TRUE).

Connecting to the database (listening on the default port number 5432) is again familiar:

> con <- dbConnect(dbDriver('PostgreSQL'), user = 'user',
+   password = 'password', dbname = 'hflights_db')

Let's verify that we are connected to the right database instance, which should be currently empty without the hflights table:

> dbListTables(con)
character(0)
> dbExistsTable(con, 'hflights')
[1] FALSE

Then let's write our demo table in PostgreSQL and see if the old rumor about it being slower than MySQL is still true:

> dbWriteTable(con, 'hflights', hflights)
[1] TRUE
> system.time(dbReadTable(con, 'hflights'))
   user  system elapsed 
  0.590   0.013   0.921

Seems to be impressive! What about loading partial data?

> system.time(dbGetQuery(con,
+ statement = "SELECT * FROM hflights WHERE \"Dest\" = 'BNA';"))
   user  system elapsed 
  0.026   0.000   0.082

Just under 100 milliseconds without indexing! Please note the extra escaped quotes around Dest, as the default PostgreSQL behavior folds unquoted column names to lower case, which would result in a column dest does not exist error. Creating an index and running the preceding query with much improved speed can be easily reproduced based on the MySQL example.

Oracle database

Oracle Database Express Edition can be downloaded and installed from http://www.oracle.com/technetwork/database/database-technologies/express-edition/downloads/index.html. Although this is not a full-featured Oracle database, and it suffers from serious limitations, the Express Edition is a free and not too resource-hungry way to build a test environment at home.

Note

Oracle database is said to be the most popular database management system in the world, although it is available only with a proprietary license, unlike the previous two discussed RDBMSs, which means that Oracle offers the product with term licensing. On the other hand, the paid license also comes with priority support from the developer company, which is often a strict requirement in enterprise environments. Oracle Database has supported a variety of nice features since its first release in 1980, such as sharding, master-master replication, and full ACID properties.

Another way of getting a working Oracle database for testing purposes is to download an Oracle Pre-Built Developer VM from http://www.oracle.com/technetwork/community/developer-vm/index.html, or a much smaller image custom created for Hands-on Database Application Development at Oracle Technology Network Developer Day: http://www.oracle.com/technetwork/database/enterprise-edition/databaseappdev-vm-161299.html. We will follow the instructions from the latter source.

After accepting the License Agreement and registering for free at Oracle, we can download the OTN_Developer_Day_VM.ova virtual appliance. Let's import it to VirtualBox via Import appliance in the File menu, then choose the ova file, and click Next:

After clicking Import, you will have to agree again to the Software License Agreement. Importing the virtual disk image (15 GB) might take a few minutes:

After importing has finished, we should first update the networking configuration so that we can access the internal database of the virtual machine from outside. So let's switch from NAT to Bridged Adapter in the settings:

Then we can simply start the newly created virtual machine in VirtualBox. After Oracle Linux has booted, we can log in with the default oracle password.

Although we have set a bridged networking interface for our virtual machine, which means that the VM is directly connected to our real sub-network with a real IP address, the machine is not yet accessible over the network. To connect with the default DHCP settings, simply navigate to the top red bar and look for the networking icon, then select System eth0. After a few seconds the VM is accessible from your host machine, as the guest system should be connected to your network. You can verify that by running the ifconfig or ip addr show eth0 command in the already running console:

Unfortunately, this already running Oracle database is not yet accessible outside the guest machine. The developer VM comes with a rather strict firewall by default, which should be disabled first. To see the rules in effect, run the standard iptables -L -n command and, to flush all rules, execute iptables -F:

Now that we have a running and remotely accessible Oracle database, let's prepare the R client side. Installing the ROracle package might get tricky on some operating systems, as there are no prebuilt binary packages and you have to manually install the Oracle Instant Client Lite and SDK libraries before compiling the package from source. If the compiler complained about the path of your previously installed Oracle libraries, please pass the --with-oci-lib and --with-oci-inc arguments with your custom paths with the --configure-args parameter. More details can be found in the package installation document: http://cran.r-project.org/web/packages/ROracle/INSTALL.

For example, on Arch Linux you can install the Oracle libs from AUR, then run the following command in bash after downloading the R package from CRAN:

# R CMD INSTALL --configure-args='--with-oci-lib=/usr/include/    \
>  --with-oci-inc=/usr/share/licenses/oracle-instantclient-basic' \
>  ROracle_1.1-11.tar.gz

After installing and loading the package, opening a connection is extremely similar to the pervious examples with DBI::dbConnect. We only pass an extra parameter here. First, let us specify the hostname or direct IP address of the Oracle database included in the dbname argument. Then we can connect to the already existing PDB1 database of the developer machine instead of the previously used hflights_db—just to save some time and space in the book on slightly off-topic database management tasks:

> library(ROracle)
Loading required package: DBI
> con <- dbConnect(dbDriver('Oracle'), user = 'pmuser',
+   password = 'oracle', dbname = '//192.168.0.16:1521/PDB1')

And we have a working connection to Oracle RDBMS:

> summary(con)
User name:             pmuser 
Connect string:        //192.168.0.16:1521/PDB1 
Server version:        12.1.0.1.0 
Server type:           Oracle RDBMS 
Results processed:     0 
OCI prefetch:          FALSE 
Bulk read:             1000 
Statement cache size:  0 
Open results:          0

Let's see what we have in the bundled database on the development VM:

> dbListTables(con)
[1] "TICKER_G" "TICKER_O" "TICKER_A" "TICKER"

So it seems that we have a table called TICKER with three views on tick data of three symbols. Saving the hflights table in the same database will not do any harm, and we can also instantly test the speed of the Oracle database when reading the whole table:

> dbWriteTable(con, 'hflights', hflights)
[1] TRUE
> system.time(dbReadTable(con, 'hflights'))
   user  system elapsed 
  0.980   0.057   1.256

And the extremely familiar subset with 3,481 cases:

> system.time(dbGetQuery(con,
+ "SELECT * FROM \"hflights\" WHERE \"Dest\" = 'BNA'"))
   user  system elapsed
  0.046   0.003   0.131

Please note the quotes around the table name. In the previous examples with MySQL and PostgreSQL, the SQL statements run fine without those. However, the quotes are needed in the Oracle database, as we have saved the table with an all-lowercase name, and the default rule in Oracle DB is to store object names in upper case. The only other option is to use double quotes to create them, which is what we did; thus we have to refer to the table with quotes around the lowercase name.

Note

We started with unquoted table and column names in MySQL, then had to add escaped quotes around the variable name in the PostgreSQL query run from R, and now in Oracle database we have to put both names between quotes—which demonstrates the slight differences in the various SQL flavors (such as MySQL, PostgreSQL, PL/SQL of Oracle or Microsoft's Transact-SQL) on top of ANSI SQL.

And more importantly: do not stick to one database engine with all your projects, but rather choose the optimal DB for the task if company policy doesn't stop you doing so.

These results were not so impressive compared to what we have seen by PostgreSQL, so let's also see the results of an indexed query:

> dbSendQuery(con, 'CREATE INDEX Dest_idx ON "hflights" ("Dest")')
Statement:            CREATE INDEX Dest_idx ON "hflights" ("Dest") 
Rows affected:        0 
Row count:            0 
Select statement:     FALSE 
Statement completed:  TRUE 
OCI prefetch:         FALSE 
Bulk read:            1000 
> system.time(dbGetQuery(con, "SELECT * FROM \"hflights\"
+ WHERE \"Dest\" = 'BNA'"))
   user  system elapsed 
  0.023   0.000   0.069

I leave the full-scale comparative testing and benchmarking to you, so that you can run custom queries in the tests fitting your exact needs. It is highly possible that the different database engines perform differently in special use cases.

To make this process a bit more seamless and easier to implement, let's check out another R way of connecting to databases, although probably with a slight performance trade-off. For a quick scalability and performance comparison on connecting to Oracle databases with different approaches in R, please see https://blogs.oracle.com/R/entry/r_to_oracle_database_connectivity.

ODBC database access

As mentioned earlier, installing the native client software, libraries, and header files for the different databases so that the custom R packages can be built from source can be tedious and rather tricky in some cases. Fortunately, we can also try to do the opposite of this process. An alternative solution can be installing a middleware Application Programming Interface (API) in the databases, so that R, or as a matter of fact any other tool, could communicate with them in a standardized and more convenient way. However, please be advised that this more convenient way impairs performance due to the translation layer between the application and the DBMS.

The RODBC package implements access to such a layer. The Open Database Connectivity (ODBC) driver is available for most database management systems, even for CSV and Excel files, so RODBC provides a standardized way to access data in almost any databases if the ODBC driver is installed. This platform-independent interface is available for SQLite, MySQL, MariaDB, PostgreSQL, Oracle database, Microsoft SQL Server, Microsoft Access, and IBM DB2 on Windows and on Linux.

For a quick example, let's connect to MySQL running on localhost (or on a virtual machine). First, we have to set up a Database Source Name (DSN) with the connection details, such as:

Database driver
Host name or address and port number, optionally a Unix socket
Database name
Optionally the username and password to be used for the connection

This can be done in the command line by editing the odbc.ini and odbcinst.ini files on Linux after installing the unixODBC program. The latter should include the following configuration for the MySQL driver in your /etc folder:

[MySQL]
Description     = ODBC Driver for MySQL
Driver          = /usr/lib/libmyodbc.so
Setup           = /usr/lib/libodbcmyS.so
FileUsage       = 1

The odbc.ini file includes the aforementioned DSN configuration for the exact database and server:

[hflights]
Description     = MySQL hflights test
Driver          = MySQL
Server          = localhost
Database        = hflights_db
Port            = 3306
Socket          = /var/run/mysqld/mysqld.sock

Or use a graphical user interface on Mac OS or Windows, as shown in the following screenshot:

After configuring a DSN, we can connect with a one-line command:

> library(RODBC)
> con <- odbcConnect("hflights", uid = "user", pwd = "password")

Let's fetch the data we saved in the database before:

> system.time(hflights <- sqlQuery(con, "select * from hflights"))
   user  system elapsed 
  3.180   0.000   3.398

Well, it took a few seconds to finish. That's the trade-off for using a more convenient and high-level interface to interact with the database. Removing and uploading data to the database can be done with similar high-level functions (such as sqlFetch) besides the odbc* functions, providing low-level access to the database. Quick examples:

> sqlDrop(con, 'hflights')
> sqlSave(con, hflights, 'hflights')

You can use the exact same commands to query any of the other supported database engines; just be sure to set up the DSN for each backend, and to close your connections if not needed any more:

> close(con)

The RJDBC package can provide a similar interface to database management systems with a Java Database Connectivity (JDBC) driver.

Using a graphical user interface to connect to databases

Speaking of high-level interfaces, R also has a graphical user interface to connect to MySQL in the dbConnect package:

> library(dbConnect)
Loading required package: RMySQL
Loading required package: DBI
Loading required package: gWidgets
> DatabaseConnect()
Loading required package: gWidgetsRGtk2
Loading required package: RGtk2

No parameters, no custom configuration in the console, just a simple dialog window:

Using a graphical user interface to connect to databases

After providing the required connection information, we can easily view the raw data and the column/variable types, and run custom SQL queries. A basic query builder can also help novice users to fetch subsamples from the database:

The package ships with a handy function called sqlToR, which can turn the SQL results into R objects with a click in the GUI. Unfortunately, dbConnect relies heavily on RMySQL, which means it's a MySQL-only package, and there is no plan to extend the functionality of this interface.

Other database backends

Besides the previously mentioned popular databases, there are several other implementations that we cannot discuss here in detail.

For example, column-oriented database management systems, such as MonetDB, are often used to store large datasets with millions of rows and thousands of columns to provide the backend for high-performance data mining. It also has great R support with the MonetDB.R package, which was among the most exciting talks at the useR! 2013 conference.

The ever-growing popularity of the NoSQL ecosystem also provides similar approaches, although usually without supporting SQL and providing a schema-free data storage. Apache Cassandra is a good example of such a similar, column-oriented, and primarily distributed database management system with high availably and performance, run on commodity hardware. The RCassandra package provides access to the basic Cassandra features and the Cassandra Query Language in a convenient way with the RC.* function family. Another Google Bigtable-inspired and similar database engine is HBase, which is supported by the rhbase package, part of the RHadoop project: https://github.com/RevolutionAnalytics/RHadoop/wiki.

Speaking of Massively Parallel Processing, HP's Vertica and Cloudera's open-source Impala are also accessible from R, so you can easily access and query large amount of data with relatively good performance.

One of the most popular NoSQL databases is MongoDB, which provides document-oriented data storage in a JSON-like format, providing an infrastructure to dynamic schemas. MongoDB is actively developed and has some SQL-like features, such as a query language and indexing, also with multiple R packages providing access to this backend. The RMongo package uses the mongo-java-driver and thus depends on Java, but provides a rather high-level interface to the database. Another implementation, the rmongodb package, is developed and maintained by the MongoDB Team. The latter has more frequent updates and more detailed documentation, but the R integration seems to be a lot more seamless with the first package as rmongodb provides access to the raw MongoDB functions and BSON objects, instead of concentrating on a translation layer for general R users. A more recent and really promising package supporting MongoDB is mongolite developed by Jeroen Ooms.

CouchDB, my personal favorite for most schema-less projects, provides very convenient document storage with JSON objects and HTTP API, which means that integrating in applications, such as any R script, is really easy with, for example, the RCurl package, although you may find the R4CouchDB more quick to act in interacting with the database.

Google BigQuery also provides a similar, REST-based HTTP API to query even terabytes of data hosted in the Google infrastructure with an SQL-like language. Although the bigrquery package is not available on CRAN yet, you may easily install it from GitHub with the devtools package from the same author, Hadley Wickham:

> library(devtools)
> install_github('bigrquery', 'hadley')

To test-drive the features of this package and Google BigQuery, you can sign up for a free account to fetch and process the demo dataset provided by Google, respecting the 10,000 requests per day limitation for free usage. Please note that the current implementation is a read-only interface to the database.

For rather similar database engines and comparisons, see for example http://db-engines.com/en/systems. Most of the popular databases already have R support but, if not, I am pretty sure that someone is already working on it. It's worth checking the CRAN packages at http://cran.r-project.org/web/packages/available_packages_by_name.html or searching on GitHub or on http://R-bloggers.com to see how other R users manage to interact with your database of choice.

Loading Excel spreadsheets

One of the most popular file formats to store and transfer relatively small amounts of data in academic institutions and businesses (besides CSV files) is still Excel xls (or xlsx, more recently). The first is a proprietary binary file format from Microsoft, which is exhaustively documented (the xls specification is available in a document of more than 1,100 pages and 50 megabytes!), but importing multiple sheets, macros, and formulas is not straightforward even nowadays. This section will only cover the most used platform-independent packages to interact with Excel.

One option is to use the previously discussed RODBC package with the Excel driver to query an Excel spreadsheet. Other ways of accessing Excel data depend on third-party tools, such as using Perl to automatically convert the Excel file to CSV then importing it into R as the read.xls function from the gdata package. But installing Perl on Windows sometimes seems to be tedious; thus, RODBC might be a more convenient method on that platform.

Some platform-independent, Java-based solutions also provide a way to not just read, but also write Excel files, especially to the xlsx, the Office Open XML file, format. Two separate implementations exist on CRAN to read and write Excel 2007 and the 97/2000/XP/2003 file formats: the xlConnect and the xlsx packages. Both are actively maintained, and use the Apache POI Java API project. This latter means that it runs on any platform that supports Java, and there is no need to have Microsoft Excel or Office on the computer; both packages can read and write Excel files on their own.

On the other hand, if you would rather not depend on Perl or Java, the recently published openxlsx package provides a platform-independent (C++-powered) way of reading and writing xlsx files. Hadley Wickham released a similar package, but with a slightly modified scope: the readxl package can read (but not write) both the xls and xlsx file formats.

Remember: pick the most appropriate tool for your needs! For example to read Excel files without many external dependencies, I'd choose readxl; but, for writing Excel 2003 spreadsheets with cell formatting and more advanced features, probably we cannot save the Java dependency and should use the xlConnect or xlsx packages over the xlsx-only openxlsx package.

Fabien Deneuville Mar 27, 2016

Ce livre est parfait pour les gens qui connaissent déjà bien R et veulent aller plus loin.J'ai découvert des choses que je ne soupçonnais pas, j'ai particulièrement apprécié le chapitre sur les graphes, celui sur les séries temporelles, le clustering... en fait il s'agit de zooms très utiles pour qui veut aller plus loin dans l'utilisation de R pour le data mining.

Amazon Verified review

Duncan W. Robinson Oct 29, 2015

I found "Mastering Data Analysis with R" useful & very readable. I especially liked the sections dealing with obtaining, filtering & manipulating data. Though some of this information can be found in various R package vignettes, much of it, including useful tips, is all in one place here. For example, I found the section on using functions from the R package "dplyr" useful for merging data sets. The reader of this book might do well to follow-up this volume with Hadley Wickham’s "Advanced R". I also enjoyed the treatment given to importing the corpus for text analysis. Python generally seems to be a bit easier to navigate for the pre-processing step when performing text analytics. This book made conducting the process in R much easier for me.