Most projects in R start with loading at least some data into the running R session. As R supports a variety of file formats and database backend, there are several ways to do so. In this chapter, we will not deal with basic data structures, which are already familiar to you, but will concentrate on the performance issue of loading larger datasets and dealing with special file formats.
For a quick overview on the standard tools and to refresh your knowledge on importing general data, please see Chapter 7 of the official An Introduction to R manual of CRAN at http://cran.r-project.org/doc/manuals/R-intro.html#Reading-data-from-files or Rob Kabacoff's Quick-R site, which offers keywords and cheat-sheets for most general tasks in R at http://www.statmethods.net/input/importingdata.html. For further materials, please see the References section in the Appendix.
Although R has its own (serialized) binary
rds file formats, which are extremely convenient to use for all R users as these also store R object meta-information in an efficient way, most of the time we have to deal with other input formats—provided by our employer or client.
One of the most popular data file formats is flat files, which are simple text files in which the values are separated by white-space, the pipe character, commas, or more often by semi-colon in Europe. This chapter will discuss several options R has to offer to load these kinds of documents, and we will benchmark which of these is the most efficient approach to import larger files.
Sometimes we are only interested in a subset of a dataset; thus, there is no need to load all the data from the sources. In such cases, database backend can provide the best performance, where the data is stored in a structured way preloaded on our system, so we can query any subset of that with simple and efficient commands. The second section of this chapter will focus on the three most popular databases (MySQL, PostgreSQL, and Oracle Database), and how to interact with those in R.
Besides some other helper tools and a quick overview on other database backend, we will also discuss how to load Excel spreadsheets into R—without the need to previously convert those to text files in Excel or Open/LibreOffice.
Of course this chapter is not just about data file formats, database connections, and such boring internals. But please bear in mind that data analytics always starts with loading data. This is unavoidable, so that our computer and statistical environment know the structure of the data before doing some real analytics.
The title of this chapter might also be Hello, Big Data!, as now we concentrate on loading relatively large amount of data in an R session. But what is Big Data, and what amount of data is problematic to handle in R? What is reasonable size?
R was designed to process data that fits in the physical memory of a single computer. So handling datasets that are smaller than the actual accessible RAM should be fine. But please note that the memory required to process data might become larger while doing some computations, such as principal component analysis, which should be also taken into account. I will refer to this amount of data as reasonable sized datasets.
Loading data from text files is pretty simple with R, and loading any reasonable sized dataset can be achieved by calling the good old
read.table function. The only issue here might be the performance: how long does it take to read, for example, a quarter of a million rows of data? Let's see:
> library('hflights') > write.csv(hflights, 'hflights.csv', row.names = FALSE)
As a reminder, please note that all R commands and the returned output are formatted as earlier in this book. The commands starts with
> on the first line, and the remainder of multi-line expressions starts with
+, just as in the R console. To copy and paste these commands on your machine, please download the code examples from the Packt homepage. For more details, please see the What you need for this book section in the Preface.
> str(hflights) 'data.frame': 227496 obs. of 21 variables: $ Year : int 2011 2011 2011 2011 2011 2011 2011 ... $ Month : int 1 1 1 1 1 1 1 1 1 1 ... $ DayofMonth : int 1 2 3 4 5 6 7 8 9 10 ... $ DayOfWeek : int 6 7 1 2 3 4 5 6 7 1 ... $ DepTime : int 1400 1401 1352 1403 1405 1359 1359 ... $ ArrTime : int 1500 1501 1502 1513 1507 1503 1509 ... $ UniqueCarrier : chr "AA" "AA" "AA" "AA" ... $ FlightNum : int 428 428 428 428 428 428 428 428 428 ... $ TailNum : chr "N576AA" "N557AA" "N541AA" "N403AA" ... $ ActualElapsedTime: int 60 60 70 70 62 64 70 59 71 70 ... $ AirTime : int 40 45 48 39 44 45 43 40 41 45 ... $ ArrDelay : int -10 -9 -8 3 -3 -7 -1 -16 44 43 ... $ DepDelay : int 0 1 -8 3 5 -1 -1 -5 43 43 ... $ Origin : chr "IAH" "IAH" "IAH" "IAH" ... $ Dest : chr "DFW" "DFW" "DFW" "DFW" ... $ Distance : int 224 224 224 224 224 224 224 224 224 ... $ TaxiIn : int 7 6 5 9 9 6 12 7 8 6 ... $ TaxiOut : int 13 9 17 22 9 13 15 12 22 19 ... $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ... $ CancellationCode : chr "" "" "" "" ... $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
hflights package provides an easy way to load a subset of the huge Airline Dataset of the Research and Innovation Technology Administration at the Bureau of Transportation Statistics. The original database includes the scheduled and actual departure/arrival times of all US flights along with some other interesting information since 1987, and is often used to demonstrate machine learning and Big Data technologies. For more details on the dataset, please see the column description and other meta-data at http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0.
We will use this 21-column data to benchmark data import times. For example, let's see how long it takes to import the CSV file with
> system.time(read.csv('hflights.csv')) user system elapsed 1.730 0.007 1.738
It took a bit more than one and a half seconds to load the data from an SSD here. It's quite okay, but we can achieve far better results by identifying then specifying the classes of the columns instead of calling the default
type.convert (see the docs in
read.table for more details or search on StackOverflow, where the performance of
read.csv seems to be a rather frequent and popular question):
> colClasses <- sapply(hflights, class) > system.time(read.csv('hflights.csv', colClasses = colClasses)) user system elapsed 1.093 0.000 1.092
It's much better! But should we trust this one observation? On our way to mastering data analysis in R, we should implement some more reliable tests—by simply replicating the task n times and providing a summary on the results of the simulation. This approach provides us with performance data with multiple observations, which can be used to identify statistically significant differences in the results. The
microbenchmark package provides a nice framework for such tasks:
> library(microbenchmark) > f <- function() read.csv('hflights.csv') > g <- function() read.csv('hflights.csv', colClasses = colClasses, + nrows = 227496, comment.char = '') > res <- microbenchmark(f(), g()) > res Unit: milliseconds expr min lq median uq max neval f() 1552.3383 1617.8611 1646.524 1708.393 2185.565 100 g() 928.2675 957.3842 989.467 1044.571 1284.351 100
So we defined two functions:
f stands for the default settings of
read.csv while, in the
g function, we passed the aforementioned column classes along with two other parameters for increased performance. The
comment.char argument tells R not to look for comments in the imported data file, while the
nrows parameter defined the exact number of rows to read from the file, which saves some time and space on memory allocation. Setting
FALSE might also speed up importing a bit.
Identifying the number of lines in the text file could be done with some third-party tools, such as
wc on Unix, or a slightly slower alternative would be the
countLines function from the
But back to the results. Let's also visualize the median and related descriptive statistics of the test cases, which was run 100 times by default:
> boxplot(res, xlab = '', + main = expression(paste('Benchmarking ', italic('read.table'))))
The difference seems to be significant (please feel free to do some statistical tests to verify that), so we made a 50+ percent performance boost simply by fine-tuning the parameters of
Loading a larger amount of data into R from CSV files that would not fit in the memory could be done with custom packages created for such cases. For example, both the
sqldf package and the
ff package have their own solutions to load data from chunk to chunk in a custom data format. The first uses SQLite or another SQL-like database backend, while the latter creates a custom data frame with the
ffdf class that can be stored on disk. The
bigmemory package provides a similar approach. Usage examples (to be benchmarked) later:
> library(sqldf) > system.time(read.csv.sql('hflights.csv')) user system elapsed 2.293 0.090 2.384 > library(ff) > system.time(read.csv.ffdf(file = 'hflights.csv')) user system elapsed 1.854 0.073 1.918 > library(bigmemory) > system.time(read.big.matrix('hflights.csv', header = TRUE)) user system elapsed 1.547 0.010 1.559
Please note that the header defaults to
read.big.matrix from the
bigmemory package, so be sure to read the manual of the referenced functions before doing your own benchmarks. Some of these functions also support performance tuning like
read.table. For further examples and use cases, please see the Large memory and out-of-memory data section of the High-Performance and Parallel Computing with R CRAN Task View at http://cran.r-project.org/web/views/HighPerformanceComputing.html.
Another notable alternative for handling and loading reasonable sized data from flat files to R is the
data.table package. Although it has a unique syntax differing from the traditional S-based R markup, the package comes with great documentation, vignettes, and case studies on the indeed impressive speedup it can offer for various database actions. Such uses cases and examples will be discussed in the Chapter 3, Filtering and Summarizing Data and Chapter 4, Restructuring Data.
The package ships a custom R function to read text files with improved performance:
> library(data.table) > system.time(dt <- fread('hflights.csv')) user system elapsed 0.153 0.003 0.158
Loading the data was extremely quick compared to the preceding examples, although it resulted in an R object with a custom
data.table class, which can be easily transformed to the traditional
data.frame if needed:
> df <- as.data.frame(dt)
Or by using the
setDF function, which provides a very fast and in-place method of object conversion without actually copying the data in the memory. Similarly, please note:
> is.data.frame(dt)  TRUE
This means that a
data.table object can fall back to act as a
data.frame for traditional usage. Leaving the imported data as is or transforming it to
data.frame depends on the latter usage. Aggregating, merging, and restructuring data with the first is faster compared to the standard data frame format in R. On the other hand, the user has to learn the custom syntax of
DT[i, j, by] stands for "from DT subset by
i, then do
j grouped by by". We will discuss it later in the Chapter 3, Filtering and Summarizing Data.
Now, let's compare all the aforementioned data import methods: how fast are they? The final winner seems to be
data.table anyway. First, we define some methods to be benchmarked by declaring the test functions:
> .read.csv.orig <- function() read.csv('hflights.csv') > .read.csv.opt <- function() read.csv('hflights.csv', + colClasses = colClasses, nrows = 227496, comment.char = '', + stringsAsFactors = FALSE) > .read.csv.sql <- function() read.csv.sql('hflights.csv') > .read.csv.ffdf <- function() read.csv.ffdf(file = 'hflights.csv') > .read.big.matrix <- function() read.big.matrix('hflights.csv', + header = TRUE) > .fread <- function() fread('hflights.csv')
Now, let's run all these functions 10 times each instead of several hundreds of iterations like previously—simply to save some time:
> res <- microbenchmark(.read.csv.orig(), .read.csv.opt(), + .read.csv.sql(), .read.csv.ffdf(), .read.big.matrix(), .fread(), + times = 10)
And print the results of the benchmark with a predefined number of digits:
> print(res, digits = 6) Unit: milliseconds expr min lq median uq max neval .read.csv.orig() 2109.643 2149.32 2186.433 2241.054 2421.392 10 .read.csv.opt() 1525.997 1565.23 1618.294 1660.432 1703.049 10 .read.csv.sql() 2234.375 2265.25 2283.736 2365.420 2599.062 10 .read.csv.ffdf() 1878.964 1901.63 1947.959 2015.794 2078.970 10 .read.big.matrix() 1579.845 1603.33 1647.621 1690.067 1937.661 10 .fread() 153.289 154.84 164.994 197.034 207.279 10
Please note that now we were dealing with datasets fitting in actual physical memory, and some of the benchmarked packages are designed and optimized for far larger databases. So it seems that optimizing the
read.table function gives a great performance boost over the default settings, although if we are after really fast importing of reasonable sized data, using the
data.table package is the optimal solution.
Sometimes we only need some parts of the dataset for an analysis, stored in a database backend or in flat files. In such situations, loading only the relevant subset of the data frame will result in much more speed improvement compared to any performance tweaks and custom packages discussed earlier.
Let's imagine we are only interested in flights to Nashville, where the annual useR! conference took place in 2012. This means we need only those rows of the CSV file where the
BNA (this International Air Transport Association airport code stands for Nashville International Airport).
Instead of loading the whole dataset in 160 to 2,000 milliseconds (see the previous section) and then dropping the unrelated rows (see in Chapter 3, Filtering and Summarizing Data), let's see the possible ways of filtering the data while loading it.
> df <- read.csv.sql('hflights.csv', + sql = "select * from file where Dest = '\"BNA\"'")
sql argument defaults to
"select * from file", which means loading all fields of each row without any filters. Now we extended that with a
filter statement. Please note that in our updated SQL statements, we also added the double quotes to the search term, as
sqldf does not automatically recognize the quotes as special; it regards them as part of the fields. One may overcome this issue also by providing a custom filter argument, such as the following example on Unix-like systems:
> df <- read.csv.sql('hflights.csv', + sql = "select * from file where Dest = 'BNA'", + filter = 'tr -d ^\\" ')
The resulting data frame holds only 3,481 observations out of the 227,496 cases in the original dataset, and filtering inside the temporary SQLite database of course speeds up data importing a bit:
> system.time(read.csv.sql('hflights.csv')) user system elapsed 2.117 0.070 2.191 > system.time(read.csv.sql('hflights.csv', + sql = "select * from file where Dest = '\"BNA\"'")) user system elapsed 1.700 0.043 1.745
The slight improvement is due to the fact that both R commands first loaded the CSV file to a temporary SQLite database; this process of course takes some time and cannot be eliminated from this process. To speed up this part of the evaluation, you can specify
NULL for a performance boost. This way, the SQLite database would be created in memory instead of a
tempfile, which might not be an optimal solution for larger datasets.
Is there a faster or smarter way to load only a portion of such a text file? One might apply some regular expression-based filtering on the flat files before passing them to R. For example,
ack might be a great tool to do so in a Unix environment, but it's not available by default on Windows machines, and parsing CSV files by regular expressions might result in some unexpected side-effects as well. Believe me, you never want to write a CSV, JSON, or XML parser from scratch!
Anyway, a data scientist nowadays should be a real jack-of-all-trades when it comes to processing data, so here comes a quick and dirty example to show how one could read the filtered data in less than 100 milliseconds:
> system.time(system('cat hflights.csv | grep BNA', intern = TRUE)) user system elapsed 0.040 0.050 0.082
Well, that's a really great running time compared to any of our previous results! But what if we want to filter for flights with an arrival delay of more than 13.5 minutes?
Another way, and probably a more maintainable approach, would be to first load the data into a database backend, and query that when any subset of the data is needed. This way we could for example, simply populate a SQLite database in a file only once, and then later we could fetch any subsets in a fragment of
read.csv.sql's default run time.
So let's create a persistent SQLite database:
> sqldf("attach 'hflights_db' as new")
This command has just created a file named to
hflights_db in the current working directory. Next, let's create a table named
hflights and populate the content of the CSV file to the database created earlier:
> read.csv.sql('hflights.csv', + sql = 'create table hflights as select * from file', + dbname = 'hflights_db')
No benchmarking was made so far, as these steps will be run only once, while the queries for sub-parts of the dataset will probably run multiple times later:
> system.time(df <- sqldf( + sql = "select * from hflights where Dest = '\"BNA\"'", + dbname = "hflights_db")) user system elapsed 0.070 0.027 0.097
And we have just loaded the required subset of the database in less than 100 milliseconds! But we can do a lot better if we plan to often query the persistent database: why not dedicate a real database instance for our dataset instead of a simple file-based and server-less SQLite backend?
Powerful and quick ways to aggregate and filter data before loading it to R
Infrastructure to store data in a relational, more structured scheme compared to the traditional matrix model of spreadsheets and R objects
Procedures to join and merge related data
Concurrent and network access from multiple clients at the same time
Security policies and limits to access the data
A scalable and configurable backend to store data
DBI package provides a database interface, a communication channel between R and various relational database management systems (RDBMS), such as MySQL, PostgreSQL, MonetDB, Oracle, and for example Open Document Databases, and so on. There is no real need to install the package on its own because, acting as an interface, it will be installed anyway as a dependency, if needed.
Connecting to a database and fetching data is pretty similar with all these backends, as all are based on the relational model and using SQL to manage and query data. Please be advised that there are some important differences between the aforementioned database engines and that several more open-source and commercial alternatives also exist. But we will not dig into the details on how to choose a database backend or how to build a data warehouse and extract, transform, and load (ETL) workflows, but we will only concentrate on making connections and managing data from R.
SQL, originally developed at IBM, with its more than 40 years of history, is one of the most important programming languages nowadays—with various dialects and implementations. Being one of the most popular declarative languages all over the world, there are many online tutorials and free courses to learn how to query and manage data with SQL, which is definitely one of the most important tools in every data scientist's Swiss army knife.
So, besides R, it's really worth knowing your way around RDBMS, which are extremely common in any industry you may be working at as a data analyst or in a similar position.
Database backends usually run on servers remote from the users doing data analysis, but for testing purposes, it might be a good idea to install local instances on the machine running R. As the installation process can be extremely different on various operating systems, we will not enter into any details of the installation steps, but we will rather refer to where the software can be downloaded from and some further links to great resources and documentation for installation.
Please note that installing and actually trying to load data from these databases is totally optional and you do not have to follow each step—the rest of the book will not depend on any database knowledge or prior experience with databases. On the other hand, if you do not want to mess your workspace with temporary installation of multiple database applications for testing purposes, using virtual machines might be an optimal workaround. Oracle's
VirtualBox provides a free and easy way of running multiple virtual machines with their dedicated operating system and userspace.
For detailed instructions on how to download then import a
VirtualBox image, see the Oracle section.
This way you can quickly deploy a fully functional, but disposable, database environment to test-drive the following examples of this chapter. In the following image, you can see
VirtualBox with four installed virtual machines, of which three are running in the background to provide some database backends for testing purposes:
VirtualBox can be installed by your operating system's package manager on Linux or by downloading the installation binary/sources from https://www.virtualbox.org/wiki/Downloads. For detailed and operating-system specific installation information, please refer to the Chapter 2, Installation details of the manual: http://www.virtualbox.org/manual/.
Nowadays, setting up and running a virtual machine is really intuitive and easy; basically you only need a virtual machine image to be loaded and launched. Some virtual machines, so called appliances, include the operating system, with a number of further software usually already configured to work, for simple, easy and quick distribution.
Once again, if you do not enjoy installing and testing new software or spending time on learning about the infrastructure empowering your data needs, the following steps are not necessary and you can freely skip these optional tasks primarily described for full-stack developers/data scientists.
Such pre-configured virtual machines to be run on any computer can be downloaded from various providers on the Internet in multiple file formats, such as OVF or OVA. General purpose
VirtualBox virtual appliances can be downloaded for example from http://virtualboximages.com/vdi/index or http://virtualboxes.org/images/.
Oracle also has a repository with a bunch of useful virtual images for data scientist apprentices and other developers at http://www.oracle.com/technetwork/community/developer-vm/index.html, with for example the Oracle Big Data Lite VM developer virtual appliance featuring the following most important components:
Disclaimer: Oracle wouldn't be my first choice personally, but they did a great job with their platform-independent virtualization environment, just like with providing free developer VMs based on their commercial products. In short, it's definitely worth using the provided Oracle tools.
If you cannot reach your installed virtual machines on the network, please update your network settings to use Host-only adapter if no Internet connection is needed, or Bridged networking for a more robust setup. The latter setting will reserve an extra IP on your local network for the virtual machine; this way, it becomes accessible easily. Please find more details and examples with screenshots in the Oracle database section.
Another good source of virtual appliances created for open-source database engines is the Turnkey GNU/Linux repository at http://www.turnkeylinux.org/database. These images are based on Debian Linux, are totally free to use, and currently support the MySQL, PostgreSQL, MongoDB, and CouchDB databases.
A great advantage of the Turnkey Linux media is that it includes only open-source, free software and non-proprietary stuff. Besides, the disk images are a lot smaller and include only the required components for one dedicated database engine. This also results in far faster installation with less overhead in terms of the required disk and memory space.
Further similar virtual appliances are available at http://www.webuzo.com/sysapps/databases with a wider range of database backends, such as Cassandra, HBase, Neo4j, Hypertable, or Redis, although some of the Webuzo appliances might require a paid subscription for deployment.
And as the new cool being Docker, I even more suggest you to get familiar with its concept on deploying software containers incredibly fast. Such container can be described as a standalone filesystem including the operating system, libraries, tools, data and so is based on abstraction layers of Docker images. In practice this means that you can fire up a database including some demo data with a one-liner command on your localhost, and developing such custom images is similarly easy. Please see some simple examples and further references at my R and Pandoc-related Docker images described at https://github.com/cardcorp/card-rocker.
MySQL is the most popular open-source database engine all over the world based on the number of mentions, job offers, Google searches, and so on, summarized by the DB-Engines Ranking: http://db-engines.com/en/ranking. Mostly used in Web development, the high popularity is probably due to the fact that MySQL is free, platform-independent, and relatively easy to set up and configure—just like its drop-in replacement fork called MariaDB.
MariaDB is a community-developed, fully open-source fork of MySQL, started and led by the founder of MySQL, Michael Widenius. It was later merged with SkySQL; thus further ex-MySQL executives and investors joined the fork. MariaDB was created after Sun Microsystems bought MySQL, currently owned by Oracle, and the development of the database engine changed.
We will refer to both engines as MySQL in the book to keep it simple, as MariaDB can be considered as a drop-in replacement for MySQL, so please feel free to reproduce the following examples with either MySQL or MariaDB.
Although the installation of a MySQL server is pretty straightforward on most operating systems (https://dev.mysql.com/downloads/mysql/), one might rather prefer to have the database installed in a virtual machine. Turnkey Linux provides small but fully configured, virtual appliances for free: http://www.turnkeylinux.org/mysql.
R provides multiple ways to query data from a MySQL database. One option is to use the
RMySQL package, which might be a bit tricky for some users to install. If you are on Linux, please be sure to install the development packages of MySQL along with the MySQL client, so that the package can compile on your system. And, as there are no binary packages available on CRAN for Windows installation due to the high variability of MySQL versions, Windows users should also compile the package from source:
> install.packages('RMySQL', type = 'source')
Windows users might find the following blog post useful about the detailed installation steps: http://www.ahschulz.de/2013/07/23/installing-rmysql-under-windows/.
For the sake of simplicity, we will refer to the MySQL server as
localhost listening on the default 3306 port; user will stand as
user and password as
password in all database connections. We will work with the
hflights table in the
hflights_db database, just like in the SQLite examples a few pages earlier. If you are working in a remote or virtual server, please modify the
username, and so on arguments of the following code examples accordingly.
After successfully installing and starting the MySQL server, we have to set up a test database, which we could later populate in R. To this end, let us start the MySQL command-line tool to create the database and a test user.
This quick session can be seen in the previous screenshot, where we first connected to the MySQL server in the command-line as the
root (admin) user. Then we created a database named
hflights_db, and granted all privileges and permissions of that database to a new user called
user with the password set to
password. Then we simply verified whether we could connect to the database with the newly created user, and we exited the command-line MySQL client.
> library(RMySQL) Loading required package: DBI > con <- dbConnect(dbDriver('MySQL'), + user = 'user', password = 'password', dbname = 'hflights_db')
Now we can refer to our MySQL connection as
con, where we want to deploy the
hflights dataset for later access:
> dbWriteTable(con, name = 'hflights', value = hflights)  TRUE > dbListTables(con)  "hflights"
dbWriteTable function wrote the
hflights data frame with the same name to the previously defined connection. The latter command shows all the tables in the currently used databases, equivalent to the
SHOW TABLES SQL command. Now that we have our original CVS file imported to MySQL, let's see how long it takes to read the whole dataset:
> system.time(dbReadTable(con, 'hflights')) user system elapsed 0.993 0.000 1.058
> system.time(dbGetQuery(con, 'select * from hflights')) user system elapsed 0.910 0.000 1.158
And, just to keep further examples simpler, let's get back to the
sqldf package, which stands for "SQL select on data frames". As a matter of fact,
sqldf is a convenient wrapper around DBI's
dbSendQuery function with some useful defaults, and returns
data.frame. This wrapper can query various database engines, such as SQLite, MySQL, H2, or PostgreSQL, and defaults to the one specified in the global
sqldf.driver option; or, if that's
NULL, it will then check if any R packages have been loaded for the aforementioned backends.
As we have already loaded
sqldf will default to using MySQL instead of SQLite. But we still have to specify which connection to use; otherwise the function will try to open a new one—without any idea about our complex username and password combination, not to mention the mysterious database name. The connection can be passed in each
sqldf expression or defined once in a global option:
> options('sqldf.connection' = con) > system.time(sqldf('select * from hflights')) user system elapsed 0.807 0.000 1.014
The difference in the preceding three versions of the same task does not seem to be significant. That 1-second timing seems to be a pretty okay result compared to our previously tested methods—although loading the whole dataset with
data.table still beats this result. What about if we only need a subset of the dataset? Let's fetch only those flights ending in Nashville, just like in our previous SQLite example:
> system.time(sqldf('SELECT * FROM hflights WHERE Dest = "BNA"')) user system elapsed 0.000 0.000 0.281
This does not seem to be very convincing compared to our previous SQLite test, as the latter could reproduce the same result in less than 100 milliseconds. But please also note that that both the user and system elapsed times are zero, which was not the case with SQLite.
The returned elapsed time by
system.time means the number of milliseconds passed since the start of the evaluation. The user and system times are a bit trickier to understand; they are reported by the operating system. More or less,
user means the CPU time spent by the called process (like R or the MySQL server), while
system reports the CPU time required by the kernel and other operating system processes (such as opening a file for reading). See
?proc.time for further details.
This means that no CPU time was used at all to return the required subset of data, which took almost 100 milliseconds with SQLite. How is it possible? What if we index the database on
> dbSendQuery(con, 'CREATE INDEX Dest_idx ON hflights (Dest(3));')
This SQL query stands for creating an index named
Dest_idx in our table based on the
Dest column's first three letters.
SQL index can seriously boost the performance of a
SELECT statement with
WHERE clauses, as MySQL this way does not have to read through the entire database to match each row, but it can determine the position of the relevant search results. This performance boost becomes more and more spectacular with larger databases, although it's also worth mentioning that indexing only makes sense if subsets of data are queried most of the time. If most or all data is needed, sequential reads would be faster.
> system.time(sqldf('SELECT * FROM hflights WHERE Dest = "BNA"')) user system elapsed 0.024 0.000 0.034
It seems to be a lot better! Well, of course, we could have also indexed the SQLite database, not just the MySQL instance. To test it again, we have to revert the default
sqldf driver to SQLite, which was overridden by loading the
> options(sqldf.driver = 'SQLite') > sqldf("CREATE INDEX Dest_idx ON hflights(Dest);", + dbname = "hflights_db")) NULL > system.time(sqldf("select * from hflights where + Dest = '\"BNA\"'", dbname = "hflights_db")) user system elapsed 0.034 0.004 0.036
So it seems that both database engines are capable of returning the required subset of data in a fraction of a second, which is a lot better even compared to what we achieved with the impressive
Although SQLite proved to be faster than MySQL in some earlier examples, there are many reasons to choose the latter in most situations. First, SQLite is a file-based database, which simply means that the database should be on a filesystem attached to the computer running R. This usually means having the SQLite database and the running R session on the same computer. Similarly, MySQL can handle larger amount of data; it has user management and rule-based control on what they can do, and concurrent access to the same dataset. The smart data scientist knows how to choose his weapon—depending on the task, another database backend might be the optimal solution. Let's see what other options we have in R!
While MySQL is said to be the most popular open-source relational database management system, PostgreSQL is famous for being "the world's most advanced open source database". This means that PostgreSQL is often considered to have more features compared to the simpler but faster MySQL, including analytic functions, which has led to PostgreSQL often being described as the open-source version of Oracle.
This sounds rather funny now, as Oracle owns MySQL today. So a bunch of things have changed in the past 20-30 years of RDBMS history, and PostgreSQL is not so slow any more. On the other hand, MySQL has also gained some nice new features—for example MySQL also became ACID-compliant with the
InnoDB engine, allowing rollback to previous states of the database. There are some other differences between the two popular database servers that might support choosing either of them. Now let's see what happens if our data provider has a liking for PostgreSQL instead of MySQL!
Installing PostgreSQL is similar to MySQL. One may install the software with the operating system's package manager, download a graphical installer from http://www.enterprisedb.com/products-services-training/pgdownload, or run a virtual appliance with, for example, the free Turnkey Linux, which provides a small but fully configured disk image for free at http://www.turnkeylinux.org/postgresql.
Downloading the example code
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
After successfully installing and starting the server, let's set up the test database—just like we did after the MySQL installation:
The syntax is a bit different in some cases, and we have used some command-line tools for the user and database creation. These helper programs are shipped with PostgreSQL by default, and MySQL also have some similar functionality with
After setting up the initial test environment, or if we already have a working database instance to connect, we can repeat the previously described data management tasks with the help of the
> library(RPostgreSQL) Loading required package: DBI
If your R session starts to throw strange error messages in the following examples, it's highly possible that the loaded R packages are conflicting. You could simply start a clean R session, or detach the previously attached packages—for example,
detach('package:RMySQL', unload = TRUE).
Connecting to the database (listening on the default port number 5432) is again familiar:
> con <- dbConnect(dbDriver('PostgreSQL'), user = 'user', + password = 'password', dbname = 'hflights_db')
Let's verify that we are connected to the right database instance, which should be currently empty without the
> dbListTables(con) character(0) > dbExistsTable(con, 'hflights')  FALSE
Then let's write our demo table in PostgreSQL and see if the old rumor about it being slower than MySQL is still true:
> dbWriteTable(con, 'hflights', hflights)  TRUE > system.time(dbReadTable(con, 'hflights')) user system elapsed 0.590 0.013 0.921
> system.time(dbGetQuery(con, + statement = "SELECT * FROM hflights WHERE \"Dest\" = 'BNA';")) user system elapsed 0.026 0.000 0.082
Just under 100 milliseconds without indexing! Please note the extra escaped quotes around
Dest, as the default PostgreSQL behavior folds unquoted column names to lower case, which would result in a column
dest does not exist error. Creating an index and running the preceding query with much improved speed can be easily reproduced based on the MySQL example.
Oracle Database Express Edition can be downloaded and installed from http://www.oracle.com/technetwork/database/database-technologies/express-edition/downloads/index.html. Although this is not a full-featured Oracle database, and it suffers from serious limitations, the Express Edition is a free and not too resource-hungry way to build a test environment at home.
Oracle database is said to be the most popular database management system in the world, although it is available only with a proprietary license, unlike the previous two discussed RDBMSs, which means that Oracle offers the product with term licensing. On the other hand, the paid license also comes with priority support from the developer company, which is often a strict requirement in enterprise environments. Oracle Database has supported a variety of nice features since its first release in 1980, such as sharding, master-master replication, and full ACID properties.
Another way of getting a working Oracle database for testing purposes is to download an Oracle Pre-Built Developer VM from http://www.oracle.com/technetwork/community/developer-vm/index.html, or a much smaller image custom created for Hands-on Database Application Development at Oracle Technology Network Developer Day: http://www.oracle.com/technetwork/database/enterprise-edition/databaseappdev-vm-161299.html. We will follow the instructions from the latter source.
After accepting the License Agreement and registering for free at Oracle, we can download the
OTN_Developer_Day_VM.ova virtual appliance. Let's import it to VirtualBox via Import appliance in the File menu, then choose the
ova file, and click Next:
After importing has finished, we should first update the networking configuration so that we can access the internal database of the virtual machine from outside. So let's switch from NAT to Bridged Adapter in the settings:
Then we can simply start the newly created virtual machine in VirtualBox. After Oracle Linux has booted, we can log in with the default
Although we have set a bridged networking interface for our virtual machine, which means that the VM is directly connected to our real sub-network with a real IP address, the machine is not yet accessible over the network. To connect with the default DHCP settings, simply navigate to the top red bar and look for the networking icon, then select System eth0. After a few seconds the VM is accessible from your host machine, as the guest system should be connected to your network. You can verify that by running the
ip addr show eth0 command in the already running console:
Unfortunately, this already running Oracle database is not yet accessible outside the guest machine. The developer VM comes with a rather strict firewall by default, which should be disabled first. To see the rules in effect, run the standard
iptables -L -n command and, to flush all rules, execute
Now that we have a running and remotely accessible Oracle database, let's prepare the R client side. Installing the
ROracle package might get tricky on some operating systems, as there are no prebuilt binary packages and you have to manually install the Oracle Instant Client Lite and SDK libraries before compiling the package from source. If the compiler complained about the path of your previously installed Oracle libraries, please pass the
--with-oci-inc arguments with your custom paths with the
--configure-args parameter. More details can be found in the package installation document: http://cran.r-project.org/web/packages/ROracle/INSTALL.
For example, on Arch Linux you can install the Oracle libs from AUR, then run the following command in
bash after downloading the R package from CRAN:
# R CMD INSTALL --configure-args='--with-oci-lib=/usr/include/ \ > --with-oci-inc=/usr/share/licenses/oracle-instantclient-basic' \ > ROracle_1.1-11.tar.gz
After installing and loading the package, opening a connection is extremely similar to the pervious examples with
DBI::dbConnect. We only pass an extra parameter here. First, let us specify the hostname or direct IP address of the Oracle database included in the
dbname argument. Then we can connect to the already existing PDB1 database of the developer machine instead of the previously used
hflights_db—just to save some time and space in the book on slightly off-topic database management tasks:
> library(ROracle) Loading required package: DBI > con <- dbConnect(dbDriver('Oracle'), user = 'pmuser', + password = 'oracle', dbname = '//192.168.0.16:1521/PDB1')
And we have a working connection to Oracle RDBMS:
> summary(con) User name: pmuser Connect string: //192.168.0.16:1521/PDB1 Server version: 220.127.116.11.0 Server type: Oracle RDBMS Results processed: 0 OCI prefetch: FALSE Bulk read: 1000 Statement cache size: 0 Open results: 0
Let's see what we have in the bundled database on the development VM:
> dbListTables(con)  "TICKER_G" "TICKER_O" "TICKER_A" "TICKER"
So it seems that we have a table called
TICKER with three views on tick data of three symbols. Saving the
hflights table in the same database will not do any harm, and we can also instantly test the speed of the Oracle database when reading the whole table:
> dbWriteTable(con, 'hflights', hflights)  TRUE > system.time(dbReadTable(con, 'hflights')) user system elapsed 0.980 0.057 1.256
And the extremely familiar subset with 3,481 cases:
> system.time(dbGetQuery(con, + "SELECT * FROM \"hflights\" WHERE \"Dest\" = 'BNA'")) user system elapsed 0.046 0.003 0.131
Please note the quotes around the table name. In the previous examples with MySQL and PostgreSQL, the SQL statements run fine without those. However, the quotes are needed in the Oracle database, as we have saved the table with an all-lowercase name, and the default rule in Oracle DB is to store object names in upper case. The only other option is to use double quotes to create them, which is what we did; thus we have to refer to the table with quotes around the lowercase name.
We started with unquoted table and column names in MySQL, then had to add escaped quotes around the variable name in the PostgreSQL query run from R, and now in Oracle database we have to put both names between quotes—which demonstrates the slight differences in the various SQL flavors (such as MySQL, PostgreSQL, PL/SQL of Oracle or Microsoft's Transact-SQL) on top of ANSI SQL.
And more importantly: do not stick to one database engine with all your projects, but rather choose the optimal DB for the task if company policy doesn't stop you doing so.
These results were not so impressive compared to what we have seen by PostgreSQL, so let's also see the results of an indexed query:
> dbSendQuery(con, 'CREATE INDEX Dest_idx ON "hflights" ("Dest")') Statement: CREATE INDEX Dest_idx ON "hflights" ("Dest") Rows affected: 0 Row count: 0 Select statement: FALSE Statement completed: TRUE OCI prefetch: FALSE Bulk read: 1000 > system.time(dbGetQuery(con, "SELECT * FROM \"hflights\" + WHERE \"Dest\" = 'BNA'")) user system elapsed 0.023 0.000 0.069
I leave the full-scale comparative testing and benchmarking to you, so that you can run custom queries in the tests fitting your exact needs. It is highly possible that the different database engines perform differently in special use cases.
To make this process a bit more seamless and easier to implement, let's check out another R way of connecting to databases, although probably with a slight performance trade-off. For a quick scalability and performance comparison on connecting to Oracle databases with different approaches in R, please see https://blogs.oracle.com/R/entry/r_to_oracle_database_connectivity.
As mentioned earlier, installing the native client software, libraries, and header files for the different databases so that the custom R packages can be built from source can be tedious and rather tricky in some cases. Fortunately, we can also try to do the opposite of this process. An alternative solution can be installing a middleware Application Programming Interface (API) in the databases, so that R, or as a matter of fact any other tool, could communicate with them in a standardized and more convenient way. However, please be advised that this more convenient way impairs performance due to the translation layer between the application and the DBMS.
RODBC package implements access to such a layer. The Open Database Connectivity (ODBC) driver is available for most database management systems, even for CSV and Excel files, so
RODBC provides a standardized way to access data in almost any databases if the ODBC driver is installed. This platform-independent interface is available for SQLite, MySQL, MariaDB, PostgreSQL, Oracle database, Microsoft SQL Server, Microsoft Access, and IBM DB2 on Windows and on Linux.
Host name or address and port number, optionally a Unix socket
Optionally the username and password to be used for the connection
This can be done in the command line by editing the
odbcinst.ini files on Linux after installing the
unixODBC program. The latter should include the following configuration for the MySQL driver in your
[MySQL] Description = ODBC Driver for MySQL Driver = /usr/lib/libmyodbc.so Setup = /usr/lib/libodbcmyS.so FileUsage = 1
odbc.ini file includes the aforementioned DSN configuration for the exact database and server:
[hflights] Description = MySQL hflights test Driver = MySQL Server = localhost Database = hflights_db Port = 3306 Socket = /var/run/mysqld/mysqld.sock
Or use a graphical user interface on Mac OS or Windows, as shown in the following screenshot:
> library(RODBC) > con <- odbcConnect("hflights", uid = "user", pwd = "password")
Let's fetch the data we saved in the database before:
> system.time(hflights <- sqlQuery(con, "select * from hflights")) user system elapsed 3.180 0.000 3.398
Well, it took a few seconds to finish. That's the trade-off for using a more convenient and high-level interface to interact with the database. Removing and uploading data to the database can be done with similar high-level functions (such as
sqlFetch) besides the
odbc* functions, providing low-level access to the database. Quick examples:
> sqlDrop(con, 'hflights') > sqlSave(con, hflights, 'hflights')
You can use the exact same commands to query any of the other supported database engines; just be sure to set up the DSN for each backend, and to close your connections if not needed any more:
> library(dbConnect) Loading required package: RMySQL Loading required package: DBI Loading required package: gWidgets > DatabaseConnect() Loading required package: gWidgetsRGtk2 Loading required package: RGtk2
No parameters, no custom configuration in the console, just a simple dialog window:
After providing the required connection information, we can easily view the raw data and the column/variable types, and run custom SQL queries. A basic query builder can also help novice users to fetch subsamples from the database:
The package ships with a handy function called
sqlToR, which can turn the SQL results into R objects with a click in the GUI. Unfortunately,
dbConnect relies heavily on
RMySQL, which means it's a MySQL-only package, and there is no plan to extend the functionality of this interface.
For example, column-oriented database management systems, such as MonetDB, are often used to store large datasets with millions of rows and thousands of columns to provide the backend for high-performance data mining. It also has great R support with the
MonetDB.R package, which was among the most exciting talks at the useR! 2013 conference.
The ever-growing popularity of the NoSQL ecosystem also provides similar approaches, although usually without supporting SQL and providing a schema-free data storage. Apache Cassandra is a good example of such a similar, column-oriented, and primarily distributed database management system with high availably and performance, run on commodity hardware. The
RCassandra package provides access to the basic Cassandra features and the Cassandra Query Language in a convenient way with the
RC.* function family. Another Google Bigtable-inspired and similar database engine is HBase, which is supported by the
rhbase package, part of the
RHadoop project: https://github.com/RevolutionAnalytics/RHadoop/wiki.
Speaking of Massively Parallel Processing, HP's Vertica and Cloudera's open-source Impala are also accessible from R, so you can easily access and query large amount of data with relatively good performance.
One of the most popular NoSQL databases is MongoDB, which provides document-oriented data storage in a JSON-like format, providing an infrastructure to dynamic schemas. MongoDB is actively developed and has some SQL-like features, such as a query language and indexing, also with multiple R packages providing access to this backend. The
RMongo package uses the mongo-java-driver and thus depends on Java, but provides a rather high-level interface to the database. Another implementation, the
rmongodb package, is developed and maintained by the MongoDB Team. The latter has more frequent updates and more detailed documentation, but the R integration seems to be a lot more seamless with the first package as
rmongodb provides access to the raw MongoDB functions and BSON objects, instead of concentrating on a translation layer for general R users. A more recent and really promising package supporting MongoDB is
mongolite developed by Jeroen Ooms.
CouchDB, my personal favorite for most schema-less projects, provides very convenient document storage with JSON objects and HTTP API, which means that integrating in applications, such as any R script, is really easy with, for example, the
RCurl package, although you may find the
R4CouchDB more quick to act in interacting with the database.
Google BigQuery also provides a similar, REST-based HTTP API to query even terabytes of data hosted in the Google infrastructure with an SQL-like language. Although the
bigrquery package is not available on CRAN yet, you may easily install it from GitHub with the
devtools package from the same author, Hadley Wickham:
> library(devtools) > install_github('bigrquery', 'hadley')
To test-drive the features of this package and Google BigQuery, you can sign up for a free account to fetch and process the demo dataset provided by Google, respecting the 10,000 requests per day limitation for free usage. Please note that the current implementation is a read-only interface to the database.
For rather similar database engines and comparisons, see for example http://db-engines.com/en/systems. Most of the popular databases already have R support but, if not, I am pretty sure that someone is already working on it. It's worth checking the CRAN packages at http://cran.r-project.org/web/packages/available_packages_by_name.html or searching on GitHub or on http://R-bloggers.com to see how other R users manage to interact with your database of choice.
In a recent academic project, where my task was to implement some financial models in R, I got the demo dataset to be analyzed as Stata
dta files. Working as a contractor at the university, without access to any Stata installations, it might have been problematic to read the binary file format of another statistical software, but as the
dta file format is documented and the specification is publicly available at http://www.stata.com/help.cgi?dta, some members of the Core R Team have already implemented an R parser in the form of the
read.dta function in the
To this end, loading (and often writing) Stata—or for example SPSS, SAS, Weka, Minitab, Octave, or dBase files—just cannot be easier in R. Please see the complete list of supported file formats and examples in the package documentation or in the R Data Import/Export manual: http://cran.r-project.org/doc/manuals/r-release/R-data.html#Importing-from-other-statistical-systems.
One of the most popular file formats to store and transfer relatively small amounts of data in academic institutions and businesses (besides CSV files) is still Excel
xlsx, more recently). The first is a proprietary binary file format from Microsoft, which is exhaustively documented (the
xls specification is available in a document of more than 1,100 pages and 50 megabytes!), but importing multiple sheets, macros, and formulas is not straightforward even nowadays. This section will only cover the most used platform-independent packages to interact with Excel.
One option is to use the previously discussed
RODBC package with the Excel driver to query an Excel spreadsheet. Other ways of accessing Excel data depend on third-party tools, such as using Perl to automatically convert the Excel file to CSV then importing it into R as the
read.xls function from the
gdata package. But installing Perl on Windows sometimes seems to be tedious; thus,
RODBC might be a more convenient method on that platform.
Some platform-independent, Java-based solutions also provide a way to not just read, but also write Excel files, especially to the
xlsx, the Office Open XML file, format. Two separate implementations exist on CRAN to read and write Excel 2007 and the 97/2000/XP/2003 file formats: the
xlConnect and the
xlsx packages. Both are actively maintained, and use the Apache POI Java API project. This latter means that it runs on any platform that supports Java, and there is no need to have Microsoft Excel or Office on the computer; both packages can read and write Excel files on their own.
On the other hand, if you would rather not depend on Perl or Java, the recently published
openxlsx package provides a platform-independent (C++-powered) way of reading and writing
xlsx files. Hadley Wickham released a similar package, but with a slightly modified scope: the
readxl package can read (but not write) both the
xlsx file formats.
Remember: pick the most appropriate tool for your needs! For example to read Excel files without many external dependencies, I'd choose
readxl; but, for writing Excel 2003 spreadsheets with cell formatting and more advanced features, probably we cannot save the Java dependency and should use the
xlsx packages over the
This chapter focused on some rather boring, but important tasks that we usually do every day. Importing data is among the first steps of every data science projects, thus mastering data analysis should start with how to load data into the R session in an efficient way.
But efficiency is an ambiguous term in this sense: loading data should be quick in a technical point of view so as not to waste our time, although coding for long hours to speed up the importing process does not make much sense either.
The chapter gave a general overview on the most popular available options to read text files, to interact with databases, and to query subsets of data in R. Now you should be able to deal with all the most often used different data sources, and probably you can also choose which data source would be the ideal candidate in your projects and then do the benchmarks on your own, as we did previously.
The next chapter will extend this knowledge further by providing use cases for fetching data from the Web and different APIs. This simply means that you will be able to use public data in your projects, even if you do not yet have those in binary dataset files or on database backends.