Reader small image

You're reading from  Big Data Analytics with R

Product typeBook
Published inJul 2016
Reading LevelBeginner
PublisherPackt
ISBN-139781786466457
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Simon Walkowiak
Simon Walkowiak
author image
Simon Walkowiak

Simon Walkowiak is a cognitive neuroscientist and a managing director of Mind Project Ltd a Big Data and Predictive Analytics consultancy based in London, United Kingdom. As a former data curator at the UK Data Service (UKDS, University of Essex) European largest socio-economic data repository, Simon has an extensive experience in processing and managing large-scale datasets such as censuses, sensor and smart meter data, telecommunication data and well-known governmental and social surveys such as the British Social Attitudes survey, Labour Force surveys, Understanding Society, National Travel survey, and many other socio-economic datasets collected and deposited by Eurostat, World Bank, Office for National Statistics, Department of Transport, NatCen and International Energy Agency, to mention just a few. Simon has delivered numerous data science and R training courses at public institutions and international companies. He has also taught a course in Big Data Methods in R at major UK universities and at the prestigious Big Data and Analytics Summer School organized by the Institute of Analytics and Data Science (IADS).
Read more about Simon Walkowiak

Right arrow

Chapter 5. R with Relational Database Management Systems (RDBMSs)

We have already done quite a lot of Big Data analytics using the R language in the preceding chapters, but this book would not be complete if we didn't touch on the subject of databases. To be precise, in this chapter we will explore connectivity between R and very popular Relational Database Management Systems (RDBMSs), more commonly known as SQL databases. After reading the contents of this chapter, you will know how:

  • To set up a number of local and/or remote SQL databases, for example SQLite, PostgreSQL, and MariaDB/MySQL

  • To query and manage SQL databases directly from R (both locally and remotely) using a selection of R packages

  • To launch fully-managed, highly-scalable Amazon RDS database instances (of different types) and query their records using the R language

During the process of achieving the preceding goals, you will also be exposed to a variety of methods and techniques that will help you with the installation of specific...

Relational Database Management Systems (RDBMSs)


The abundance of  RDBMSs currently available means that it's nearly impossible to describe all or at least a large majority of them in one single chapter. If you haven't worked with any such databases in your analytical or research career, now is the best time to explore how they can benefit your Big Data processing and management activities.

A short overview of used RDBMSs

In order to give you a taste of the variety of databases available to R users, we decided to present three of them, which can be launched and connected from R in three different scenarios:

  • Locally on a personal computer

  • Locally on a virtual machine

  • Remotely with a database on a server and RStudio installed on a personal local machine

Our selection criteria also included the requirements that all databases are open-source or at least free to use, are well-maintained with an active community of users, and can operate on multiple platforms (at least on Mac OS X, Windows, and Linux...

SQLite with R


In this part of the chapter, we will query a SQLite database installed on a local, personal computer directly from RStudio. But before we can do it, follow the next section to prepare a SQLite database and read the data in.

Preparing and importing data into a local SQLite database

We mentioned earlier that SQLite is, by default, included in some distributions of popular operating systems, for example Mac OS X (since version 10.4) and in Windows 10. You can easily check whether your machine has SQLite installed by starting it through a Terminal/shell window:

$ sqlite3
SQLite version 3.12.1 2016-04-08 15:09:49
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite>

If the command produces the preceding output (or similar) your machine is already equipped with SQLite database. If for some reason your operating system does not contain SQLite, visit http://www.sqlite.org/download.html to download...

MariaDB with R on a Amazon EC2 instance


In Online Chapter, Pushing R Further (https://www.packtpub.com/sites/default/files/downloads/5396_6457OS_PushingRFurther.pdf), apart from creating a Linux Ubuntu virtual machine with RStudio Server on Microsoft Azure, we have also launched an Amazon Linux EC2 instance. In this section we will deploy an Ubuntu instance with RStudio Server, but this time on Amazon EC2.

Preparing the EC2 instance and RStudio Server for use

The good news is that you should already know how to do this. Initially, simply follow steps 1 through 13 of the section on Creating your first Amazon EC2 instance from Online Chapter, Pushing R Further (https://www.packtpub.com/sites/default/files/downloads/5396_6457OS_PushingRFurther.pdf). Choose a Free Usage Tier Linux Ubuntu instance and (in step 6) create your new key pair with a distinct name, for example rstudio_mariadb.pem. Also in the same step (step 6), add another custom TCP rule for port 3306. This will allow connectivity...

PostgreSQL with R on Amazon RDS


The methods for launching various open-source SQL databases described in the preceding two sections present just one way of connecting R to data stored in these databases. An alternative approach is offered by Amazon RDS - a managed and highly-scalable solution for database management. In fact, Amazon RDS is probably the easiest and fastest method of deploying a fully-operational SQL database, as it requires very minimal input from users. It also allows support for remote connectivity from RStudio, ensuring that users can quickly connect to the database in the cloud from the comfort of their local computer.

Launching an Amazon RDS database instance

Setting up and launching the database instance on Amazon RDS is generally very user friendly. The following instructions will guide you through the process and will let you create a small, Free Usage Tier (as of April 2016) t2.micro RDS instance with a PostgreSQL database ready for use:

Summary


We began this chapter with a very gentle introduction to Relational Database Management Systems and the basics of Structured Query Language, in order to equip you with the essential skills required to manage RDBMSs on your own.

We then moved on to practical exercises that let you explore a number of techniques of connecting R with relational databases. We first presented how to query and process data locally using a SQLite database, then we thoroughly covered connectivity with MariaDB (and also MySQL, as both are very similar) installed on an Amazon Elastic Cloud Computing instance, and finally we remotely analyzed the data stored and managed in the PostgreSQL database through the Amazon Relational Database Service instance.

Throughout the sections and tutorials of this chapter, you have learned that R can be conveniently used as a tool for the processing and analysis of large, out-of-memory collection of data stored in traditional SQL-operated databases.

In the next chapter, we will...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Big Data Analytics with R
Published in: Jul 2016Publisher: PacktISBN-13: 9781786466457
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Simon Walkowiak

Simon Walkowiak is a cognitive neuroscientist and a managing director of Mind Project Ltd a Big Data and Predictive Analytics consultancy based in London, United Kingdom. As a former data curator at the UK Data Service (UKDS, University of Essex) European largest socio-economic data repository, Simon has an extensive experience in processing and managing large-scale datasets such as censuses, sensor and smart meter data, telecommunication data and well-known governmental and social surveys such as the British Social Attitudes survey, Labour Force surveys, Understanding Society, National Travel survey, and many other socio-economic datasets collected and deposited by Eurostat, World Bank, Office for National Statistics, Department of Transport, NatCen and International Energy Agency, to mention just a few. Simon has delivered numerous data science and R training courses at public institutions and international companies. He has also taught a course in Big Data Methods in R at major UK universities and at the prestigious Big Data and Analytics Summer School organized by the Institute of Analytics and Data Science (IADS).
Read more about Simon Walkowiak