Reader small image

You're reading from  R for Data Science Cookbook (n)

Product typeBook
Published inJul 2016
Reading LevelIntermediate
Publisher
ISBN-139781784390815
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Yu-Wei, Chiu (David Chiu)
Yu-Wei, Chiu (David Chiu)
author image
Yu-Wei, Chiu (David Chiu)

Yu-Wei, Chiu (David Chiu) is the founder of LargitData (www.LargitData.com), a startup company that mainly focuses on providing big data and machine learning products. He has previously worked for Trend Micro as a software engineer, where he was responsible for building big data platforms for business intelligence and customer relationship management systems. In addition to being a start-up entrepreneur and data scientist, he specializes in using Spark and Hadoop to process big data and apply data mining techniques for data analysis. Yu-Wei is also a professional lecturer and has delivered lectures on big data and machine learning in R and Python, and given tech talks at a variety of conferences. In 2015, Yu-Wei wrote Machine Learning with R Cookbook, Packt Publishing. In 2013, Yu-Wei reviewed Bioinformatics with R Cookbook, Packt Publishing. For more information, please visit his personal website at www.ywchiu.com. **********************************Acknowledgement************************************** I have immense gratitude for my family and friends for supporting and encouraging me to complete this book. I would like to sincerely thank my mother, Ming-Yang Huang (Miranda Huang); my mentor, Man-Kwan Shan; the proofreader of this book, Brendan Fisher; Members of LargitData; Data Science Program (DSP); and other friends who have offered their support.
Read more about Yu-Wei, Chiu (David Chiu)

Right arrow

Chapter 2. Data Extracting, Transforming, and Loading

This chapter covers the following topics:

  • Downloading open data

  • Reading and writing CSV files

  • Scanning text files

  • Working with Excel files

  • Reading data from databases

  • Scraping web data

  • Accessing Facebook data

  • Working with twitteR

Introduction


Before using data to answer critical business questions, the most important thing is to prepare it. Data is normally archived in files, and using Excel or text editors allows it to be easily obtained. However, data can be located in a range of different sources, such as databases, websites, and various file formats. Being able to import data from these sources is crucial.

There are four main types of data. Data recorded in text format is the simplest. As some users require storing data in a structured format, files with a .tab or .csv extension can be used to arrange data in a fixed number of columns. For many years, Excel has had a leading role in the field of data processing, and this software uses the .xls and .xlsx formats. Knowing how to read and manipulate data from databases is another crucial skill. Moreover, as most data is not stored in a database, one must know how to use the web scraping technique to obtain data from the Internet. As part of this chapter, we introduce...

Downloading open data


Before conducting any data analysis, an essential step is to collect high-quality, meaningful data. One important data source is open data, which is selected, organized, and freely available to the public. Most open data is published online in either text format or as APIs. Here, we introduce how to download the text format of an open data file with the download.file function.

Getting ready

In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.

How to do it…

Please perform the following steps to download open data from the Internet:

  1. First, visit the http://finance.yahoo.com/q/hp?s=%5EGSPC+Historical+Prices link to view the historical price of the S&P 500 in Yahoo Finance:

    Figure 1: Historical price of S&P 500

  2. Scroll down to the bottom of the page, right-click and copy the link in Download to Spreadsheet (the link should appear similar to http://real-chart.finance.yahoo.com/table.csv?s=%5EGSPC&d=6&e...

Reading and writing CSV files


In the previous recipe, we downloaded the historical S&P 500 market index from Yahoo Finance. We can now read the data into an R session for further examination and manipulation. In this recipe, we demonstrate how to read a file with an R function.

Getting ready

In this recipe, you need to have followed the previous recipe by downloading the S&P 500 market index text file to the current directory.

How to do it…

Please perform the following steps to read text data from the CSV file.

  1. First, determine the current directory with getwd, and use list.files to check where the file is, as follows:

    > getwd()
    > list.files('./')
    
  2. You can then use the read.table function to read data by specifying the comma as the separator:

    > stock_data <- read.table('snp500.csv', sep=',' , header=TRUE)
    
  3. Next, filter data by selecting the first six rows with column Date, Open, High, Low, and Close:

    > subset_data <- stock_data[1:6, c("Date", "Open", "High", "Low", "Close...

Scanning text files


In previous recipes, we introduced how to use read.table and read.csv to load data into an R session. However, read.table and read.csv only work if the number of columns is fixed and the data size is small. To be more flexible in data processing, we will demonstrate how to use the scan function to read data from the file.

Getting ready

In this recipe, you need to have completed the previous recipes and have snp500.csv downloaded in the current directory.

How to do it…

Please perform the following steps to scan data from the CSV file:

  1. First, you can use the scan function to read data from snp500.csv:

    > stock_data3 <- scan('snp500.csv',sep=',', what=list(Date = '', Open = 0, High = 0, Low = 0,Close = 0, Volume = 0, Adj_Close = 0),  skip=1, fill=T)
    Read 16481 records
    
  2. You can then examine loaded data with mode and str:

    > mode(stock_data3)
    [1] "list"
    > str(stock_data3)
    List of 7
     $ Date     : chr [1:16481] "2015-07-02" "2015-07-01" "2015-06-30" "2015-06-29" ...
     $ Open...

Working with Excel files


Excel is another popular tool used to store and analyze data. Of course, one can convert Excel files to CSV files or other text formats by using Excel. Alternatively, to simplify the process, you can use install and load the xlsx package to read and process Excel data in R.

Getting ready

In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.

How to do it…

Please perform the following steps to read Excel documents:

  1. First, install and load the xlsx package:

    > install.packages("xlsx")
    > library(xlsx)
    
  2. Access www.data.worldbank.org/topic/economy-and-growth to find world economy indicator data in Excel:

    Figure 6: World economy indicator

  3. Download world economy indicator data from the following URL using download.file:

    > download.file("http://api.worldbank.org/v2/en/topic/3?downloadformat=excel", "worldbank.xls", mode="wb")
    
  4. Examine the downloaded file with Excel (or Open Office):

    Figure 7: Using Excel to examine...

Reading data from databases


As R reads data into memory, it is perfect for processing and analyzing small datasets. However, as an enterprise accumulates much more data than individuals in their daily lives, database documents are becoming more common for the purpose of storing and analyzing bigger data. To access databases with R, one can use RJDBC, RODBC, or RMySQL as the communications bridge. In this section, we will demonstrate how to use RJDBC to connect data stored in the database.

Getting ready

In this section, we need to prepare a MySQL environment first. If you have a MySQL environment installed on your machine (Windows), you can inspect server status from MySQL Notifier. If the local server is running, the server status should prompt localhost (Online), as shown in the following screenshot:

Figure 8: MySQL Notifier

Once we have our database server online, we need to validate whether we are authorized to access the database with a given username and password by using any database connection...

Scraping web data


In most cases, the majority of data will not exist in your database, but will instead be published in different forms on the Internet. To dig up more valuable information from these data sources, we need to know how to access and scrape data from the Web. Here, we will illustrate how to use the rvest package to harvest finance data from http://www.bloomberg.com/.

Getting ready

In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.

How to do it…

Perform the following steps to scrape data from http://www.bloomberg.com/:

  1. First, access the following link to browse the S&P 500 index on the Bloomberg Business websitehttp://www.bloomberg.com/quote/SPX:IND:

    Figure 9: S&P 500 index

  2. Once the page appears, as shown in the preceding screenshot, we can begin installing and loading the rvest package:

    >  install.packages("rvest")
    >  library(rvest)
    
  3. Next, you can use the HTML function from the rvest package to scrape and...

Accessing Facebook data


Social network data is another great source for the user who is interested in exploring and analyzing social interactions. The main difference between social network data and web data is that social network platforms often provide a semi-structured data format (mostly JSON). Thus, one can easily access the data without the need to inspect how the data is structured. In this recipe, we will illustrate how to use rvest and rson to read and parse data from Facebook.

Getting ready

In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.

How to do it…

Perform the following steps to access data from Facebook:

  1. First, we need to log in to Facebook and access the developer page (https://developers.facebook.com/):

    Figure 18: Accessing the Facebook developer page

  2. Click on Tools & Support, and select Graph API Explorer:

    Figure 19: Selecting the Graph API Explorer

  3. Next, click on Get Token, and choose Get Access Token:

    Figure...

Working with twitteR


In addition to obtaining social network interaction data, one can collect millions of tweets from Twitter for further text mining tasks. The method for retrieving data from Twitter is very similar to Facebook. For both social platforms, all we need is an access token to access insight data. After we have retrieved the access token, we can then use twitteR to access millions of tweets.

Getting ready

In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.

How to do it…

Perform the following steps to read data from Twitter:

  1. First, you need to log in to Twitter and access the page of Twitter Apps at https://apps.twitter.com/. Click on Create New App:

    Figure 26: Creating a new Twitter app

  2. Fill in all required application details to create a new application:

    Figure 27: Filling in the required details

  3. Next, you can select Keys and Access Tokens and then access Application Settings:

    Figure 28: Copying API key and secret

  4. Click on...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
R for Data Science Cookbook (n)
Published in: Jul 2016Publisher: ISBN-13: 9781784390815
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Yu-Wei, Chiu (David Chiu)

Yu-Wei, Chiu (David Chiu) is the founder of LargitData (www.LargitData.com), a startup company that mainly focuses on providing big data and machine learning products. He has previously worked for Trend Micro as a software engineer, where he was responsible for building big data platforms for business intelligence and customer relationship management systems. In addition to being a start-up entrepreneur and data scientist, he specializes in using Spark and Hadoop to process big data and apply data mining techniques for data analysis. Yu-Wei is also a professional lecturer and has delivered lectures on big data and machine learning in R and Python, and given tech talks at a variety of conferences. In 2015, Yu-Wei wrote Machine Learning with R Cookbook, Packt Publishing. In 2013, Yu-Wei reviewed Bioinformatics with R Cookbook, Packt Publishing. For more information, please visit his personal website at www.ywchiu.com. **********************************Acknowledgement************************************** I have immense gratitude for my family and friends for supporting and encouraging me to complete this book. I would like to sincerely thank my mother, Ming-Yang Huang (Miranda Huang); my mentor, Man-Kwan Shan; the proofreader of this book, Brendan Fisher; Members of LargitData; Data Science Program (DSP); and other friends who have offered their support.
Read more about Yu-Wei, Chiu (David Chiu)