You're reading from R for Data Science Cookbook (n)
Before using data to answer critical business questions, the most important thing is to prepare it. Data is normally archived in files, and using Excel or text editors allows it to be easily obtained. However, data can be located in a range of different sources, such as databases, websites, and various file formats. Being able to import data from these sources is crucial.
There are four main types of data. Data recorded in text format is the simplest. As some users require storing data in a structured format, files with a .tab
or .csv
extension can be used to arrange data in a fixed number of columns. For many years, Excel has had a leading role in the field of data processing, and this software uses the .xls
and .xlsx
formats. Knowing how to read and manipulate data from databases is another crucial skill. Moreover, as most data is not stored in a database, one must know how to use the web scraping technique to obtain data from the Internet. As part of this chapter, we introduce...
Before conducting any data analysis, an essential step is to collect high-quality, meaningful data. One important data source is open data, which is selected, organized, and freely available to the public. Most open data is published online in either text format or as APIs. Here, we introduce how to download the text format of an open data file with the download.file
function.
In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.
Please perform the following steps to download open data from the Internet:
First, visit the http://finance.yahoo.com/q/hp?s=%5EGSPC+Historical+Prices link to view the historical price of the S&P 500 in Yahoo Finance:
Scroll down to the bottom of the page, right-click and copy the link in Download to Spreadsheet (the link should appear similar to http://real-chart.finance.yahoo.com/table.csv?s=%5EGSPC&d=6&e...
In the previous recipe, we downloaded the historical S&P 500 market index from Yahoo Finance. We can now read the data into an R session for further examination and manipulation. In this recipe, we demonstrate how to read a file with an R function.
In this recipe, you need to have followed the previous recipe by downloading the S&P 500 market index text file to the current directory.
Please perform the following steps to read text data from the CSV file.
First, determine the current directory with
getwd
, and uselist.files
to check where the file is, as follows:> getwd() > list.files('./')
You can then use the
read.table
function to read data by specifying the comma as the separator:> stock_data <- read.table('snp500.csv', sep=',' , header=TRUE)
Next, filter data by selecting the first six rows with column
Date
,Open
,High
,Low
, andClose
:> subset_data <- stock_data[1:6, c("Date", "Open", "High", "Low", "Close...
In previous recipes, we introduced how to use read.table
and read.csv
to load data into an R session. However, read.table
and read.csv
only work if the number of columns is fixed and the data size is small. To be more flexible in data processing, we will demonstrate how to use the scan
function to read data from the file.
In this recipe, you need to have completed the previous recipes and have snp500.csv
downloaded in the current directory.
Please perform the following steps to scan data from the CSV file:
First, you can use the
scan
function to read data fromsnp500.csv
:> stock_data3 <- scan('snp500.csv',sep=',', what=list(Date = '', Open = 0, High = 0, Low = 0,Close = 0, Volume = 0, Adj_Close = 0), skip=1, fill=T) Read 16481 records
You can then examine loaded data with
mode
andstr
:> mode(stock_data3) [1] "list" > str(stock_data3) List of 7 $ Date : chr [1:16481] "2015-07-02" "2015-07-01" "2015-06-30" "2015-06-29" ... $ Open...
Excel is another popular tool used to store and analyze data. Of course, one can convert Excel files to CSV files or other text formats by using Excel. Alternatively, to simplify the process, you can use install
and load the xlsx
package to read and process Excel data in R.
In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.
Please perform the following steps to read Excel documents:
First, install and load the
xlsx
package:> install.packages("xlsx") > library(xlsx)
Access www.data.worldbank.org/topic/economy-and-growth to find world economy indicator data in Excel:
Download world economy indicator data from the following URL using
download.file
:> download.file("http://api.worldbank.org/v2/en/topic/3?downloadformat=excel", "worldbank.xls", mode="wb")
Examine the downloaded file with Excel (or Open Office):
As R reads data into memory, it is perfect for processing and analyzing small datasets. However, as an enterprise accumulates much more data than individuals in their daily lives, database documents are becoming more common for the purpose of storing and analyzing bigger data. To access databases with R, one can use RJDBC
, RODBC
, or RMySQL
as the communications bridge. In this section, we will demonstrate how to use RJDBC
to connect data stored in the database.
In this section, we need to prepare a MySQL environment first. If you have a MySQL environment installed on your machine (Windows), you can inspect server status from MySQL Notifier. If the local server is running, the server status should prompt localhost (Online), as shown in the following screenshot:
Once we have our database server online, we need to validate whether we are authorized to access the database with a given username and password by using any database connection...
In most cases, the majority of data will not exist in your database, but will instead be published in different forms on the Internet. To dig up more valuable information from these data sources, we need to know how to access and scrape data from the Web. Here, we will illustrate how to use the rvest
package to harvest finance data from http://www.bloomberg.com/.
In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.
Perform the following steps to scrape data from http://www.bloomberg.com/:
First, access the following link to browse the S&P 500 index on the Bloomberg Business websitehttp://www.bloomberg.com/quote/SPX:IND:
Once the page appears, as shown in the preceding screenshot, we can begin installing and loading the
rvest
package:> install.packages("rvest") > library(rvest)
Next, you can use the HTML function from the
rvest
package to scrape and...
Social network data is another great source for the user who is interested in exploring and analyzing social interactions. The main difference between social network data and web data is that social network platforms often provide a semi-structured data format (mostly JSON). Thus, one can easily access the data without the need to inspect how the data is structured. In this recipe, we will illustrate how to use rvest
and rson
to read and parse data from Facebook.
In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.
Perform the following steps to access data from Facebook:
First, we need to log in to Facebook and access the developer page (https://developers.facebook.com/):
Click on Tools & Support, and select Graph API Explorer:
Next, click on Get Token, and choose Get Access Token:
In addition to obtaining social network interaction data, one can collect millions of tweets from Twitter for further text mining tasks. The method for retrieving data from Twitter is very similar to Facebook. For both social platforms, all we need is an access token to access insight data. After we have retrieved the access token, we can then use twitteR
to access millions of tweets.
In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.
Perform the following steps to read data from Twitter:
First, you need to log in to Twitter and access the page of Twitter Apps at https://apps.twitter.com/. Click on Create New App:
Fill in all required application details to create a new application:
Next, you can select Keys and Access Tokens and then access Application Settings:
Click on...