Data Extracting, Transforming, and Loading

In this article, by Yu-Wei, Chiu, author of the book, R for Data Science Cookbook, covers the following topics:

  • Scraping web data
  • Accessing Facebook data

(For more resources related to this topic, see here.)

Before using data to answer critical business questions, the most important thing is to prepare it. Data is normally archived in files, and using Excel or text editors allows it to be easily obtained. However, data can be located in a range of different sources, such as databases, websites, and various file formats. Being able to import data from these sources is crucial.

There are four main types of data. Data recorded in a text format is the most simple. As some users require storing data in a structured format, files with a .tab or .csv extension can be used to arrange data in a fixed number of columns. For many years, Excel has held a leading role in the field of data processing, and this software uses the .xls and .xlsx formats. Knowing how to read and manipulate data from databases is another crucial skill. Moreover, as most data is not stored in a database, we must know how to use the web scraping technique to obtain data from the internet. As part of this chapter, we will introduce how to scrape data from the internet using the rvest package.

Many experienced developers have already created packages to allow beginners to obtain data more easily, and we focus on leveraging these packages to perform data extraction, transformation, and loading. In this chapter, we will first learn how to utilize R packages to read data from a text format and scan files line by line. We then move to the topic of reading structured data from databases and Excel. Finally, we will learn how to scrape internet and social network data using the R web scraper.

Scraping web data

In most cases, the majority of data will not exist in your database, but it will instead be published in different forms on the internet. To dig more valuable information from these data sources, we need to know how to access and scrape data from the Web. Here, we will illustrate how to use the rvest package to harvest finance data from http://www.bloomberg.com/.

Getting ready

For this recipe, prepare your environment with R installed on a computer with internet access.

How to do it...

Perform the following steps to scrape data from http://www.bloomberg.com/:

  1. First, access the following link to browse the S&P 500 index on the Bloomberg Business website http://www.bloomberg.com/quote/SPX:IND:

    R for Data Science Cookbook

  2. Once the page appears as shown in the preceding screenshot, we can begin installing and loading the rvest package:
    >  install.packages("rvest")
    >  library(rvest)
  3. Next, you can use the HTML function from rvest package to scrape and parse the HTML page of the link to the S&P 500 index at http://www.bloomberg.com/:

    > spx_quote <- html("http://www.bloomberg.com/quote/SPX:IND")
  4. Use the browser's built-in web inspector to inspect the location of detail quote (marked with a red rectangle) below the index chart:

    R for Data Science Cookbook

  5. You can then move the mouse over the detail quote and click on the target element that you wish to scrape down. As the following screenshot shows, the <div class="cell"> section holds all the information that we need:

    R for Data Science Cookbook

  6. Extract elements with the class of cell using the html_nodes function:
    > cell  <- spx_quote %>% html_nodes(".cell")
  7. Furthermore, we can parse the label of the detailed quote from elements with the class of cell__label, extract text from scraped HTML, and eventually clean spaces and newline characters from the extracted text:
    > label <- cell %>% 
    +     html_nodes(".cell__label")  %>% 
    +     html_text() %>%  
    +     lapply(function(e) gsub("\n|\\s+", "", e))
  8. We can also extract the value of detailed quote from the element with the class of cell__value, extract text from scraped HTML, as well as clean spaces and newline characters:
    > value <- cell %>%  
    +     html_nodes(".cell__value") %>% 
    +     html_text() %>%  
    +     lapply(function(e)gsub("\n|\\s+", "", e))
  9. Finally, we can set the extracted label as the name to value:

    > names(value) <- title
  10. Next, we can access the energy and oil market index page at this link (http://www.bloomberg.com/energy), as shown in the following screenshot:

    R for Data Science Cookbook

  11. We can then use the web inspector to inspect the location of the table element:

    R for Data Science Cookbook

  12. Finally, we can use html_table to extract the table element with class of data-table:
    > energy <- html("http://www.bloomberg.com/energy")
    > energy.table <- energy %>% 
      html_node(".data-table") %>% html_table()

How it works...

The most difficult step in scraping data from a website is that web data is published and structured in different formats. We have to fully understand how data is structured within the HTML tag before continuing.

As HTML (Hypertext Markup Language) is a language that has similar syntax to XML, we can use the XML package to read and parse HTML pages. However, the XML package only provides the XPath method, which has two main shortcomings, as follows:

  • Inconsistent behavior in different browsers
  • It is hard to read and maintain

For these reasons, we recommend using the CSS selector over XPath when parsing HTML.

Python users may be familiar with how to scrape data quickly using requests and the BeautifulSoup packages. The rvest package is the counterpart package in R, which provides the same capability to simply and efficiently harvest data from HTML pages.

In this recipe, our target is to scrape the finance data of the S&P 500 detail quote from http://www.bloomberg.com/. Our first step is to make sure that we can access our target webpage through the internet, which is followed by installing and loading the rvest package. After installation and loading is complete, we can then use the HTML function to read the source code of the page to spx_quote.

Once we have confirmed that we can read the HTML page, we can start parsing the detailed quote from the scraped HTML. However, we first need to inspect the CSS path of the detail quote. There are many ways to inspect the CSS path of a specific element. The most popular method is to use the development tool built into each browser (press F12 or FN + F12) to inspect the CSS path. Using Google Chrome as an example, you can open the development tool by pressing F12. A DevTools window may show up somewhere in the visual area (you may refer to https://developer.chrome.com/devtools/docs/dom-and-styles#inspecting-ele...).

Then, you can move the mouse cursor to the upper left of the DevTools window and select the Inspect Element icon (a magnifier icon similar to ). Next, click on the target element, and the DevTools window will highlight the source code of the selected area. You can then move the mouse cursor to the highlighted area and right-click on it. From the pop-up menu, click on Copy CSS Path to extract the CSS path. Or, you can examine the source code and find that the selected element is structured in HTML code with the class of cell.

One highlight of rvest is that it is designed to work with magrittr, so that we can use a %>% pipelines operator to chain output parsed at each stage. Thus, we can first obtain the output source by calling spx_quote and then pipe the output to html_nodes. As the html_nodes function uses CSS selector to parse elements, the function takes basic selectors with type (for example, div), ID (for example, #header), and class (for example, .cell). As the elements to be extracted have the class of cell, you should place a period (.) in front of cell.

Finally, we should extract both label and value from previously parsed nodes. Here, we first extract the element of class cell__label, and we then use html_text to extract text. We can then use the gsub function to clean spaces and newline characters from the parsed text. Likewise, we apply the same pipeline to extract the element of the class__value class. As we extracted both label and value from detail quote, we can apply the label as the name to the extracted values. We have now organized data from the web to structured data.

Alternatively, we can also use rvest to harvest tabular data. Similarly to the process used to harvest the S&P 500 index, we can first access the energy and oil market index page. We can then use the web element inspector to find the element location of table data. As we have found the element located in the class of data-table, we can use the html_table function to read the table content into an R data frame.

There's more...

Instead of using the web inspector built into each browser, we can consider using SelectorGadget (http://selectorgadget.com/) to search for the CSS path. SelectorGadget is a very powerful and simple to use extension for Google Chrome, which enables the user to extract the CSS path of the target element with only a few clicks:

  1. To begin using SelectorGadget, access this link (https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjic...). Then, click on the green button (circled in the red rectangle as shown in the following screenshot) to install the plugin to Chrome:

    R for Data Science Cookbook

  2. Next, click on the upper-right icon to open SelectorGadget, and then select the area which needs to be scraped down. The selected area will be colored green, and the gadget will display the CSS path of the area and the number of elements matched to the path:

    R for Data Science Cookbook

  3. Finally, you can paste the extracted CSS path to html_nodes as an input argument to parse the data.

Besides rvest, we can connect R to Selenium via Rselenium to scrape the web page. Selenium was originally designed as an automating web application that enables the user to command a web browser to automate processes through simple scripts. However, we can also use Selenium to scrape data from the Internet. The following instruction presents a sample demo on how to scrape Bloomberg.com using Rselenium:

  1. First, access this link to download the Selenium standalone server (http://www.seleniumhq.org/download/), as shown in the following screenshot:

    R for Data Science Cookbook

  2. Next, start the Selenium standalone server using the following command:
    $ java -jar selenium-server-standalone-2.46.0.jar

    If you can successfully launch the standalone server, you should see the following message, which means that you can connect to the server that binds to port 4444:

    R for Data Science Cookbook

  3. At this point, you can begin installing and loading RSelenium with the following command:
    > install.packages("RSelenium")
    > library(RSelenium)
  4. After RSelenium is installed, register the driver and connect to the Selenium server:

    > remDr <- remoteDriver(remoteServerAddr = "localhost" 
    +                      , port = 4444
    +                      , browserName = "firefox"
    +)
    
  5. Examine the status of the registered driver:

    > remDr$getStatus()
  6. Next, we navigate to Bloomberg.com:
    > remDr$open()
    > remDr$navigate("http://www.bloomberg.com/quote/SPX:IND ")
    
  7. Finally, we can scrape the data using the CSS selector.
    > webElem <- remDr$findElements('css selector', ".cell") 
    > webData <- sapply(webElem, function(x){
    +   label <- x$findChildElement('css selector', '.cell__label')
    +   value <- x$findChildElement('css selector', '.cell__value')
    +   cbind(c("label" = label$getElementText(), "value" = value$getElementText()))
    + }
    + )

Accessing Facebook data

Social network data is another great source for a user who is interested in exploring and analyzing social interactions. The main difference between social network data and web data is that social network platforms often provide a semi-structured data format (mostly JSON). Thus, we can easily access the data without the need to inspect how the data is structured. In this recipe, we will illustrate how to use rvest and rson to read and parse data from Facebook.

Getting ready

For this recipe, prepare your environment with R installed on a computer with Internet access.

How to do it…

Perform the following steps to access data from Facebook:

  1. First, we need to log in to Facebook and access the developer page (https://developers.facebook.com/), as shown in the following screenshot:

    R for Data Science Cookbook

  2. Click on Tools & Support and select Graph API Explorer:

    R for Data Science Cookbook

  3. Next, click on Get Token and choose Get Access Token:

    R for Data Science Cookbook

  4. On the User Data Permissions pane, select user_tagged_places and then click on Get Access Token:

    R for Data Science Cookbook

  5. Copy the generated access token to the clipboard:

    R for Data Science Cookbook

  6. Try to access Facebook API using rvest:
    > access_token <- '<access_token>'
    > fb_data <- html(sprintf("https://graph.facebook.com/me/tagged_places?access_token=%s",access_token))
  7. Install and load rjson package:
    > install.packages("rjson")
    > library(rjson)
  8. Extract the text from fb_data and then use fromJSON to read JSON data:

    > fb_json <-  fromJSON(fb_data %>% html_text())
  9. Use sapply to extract the name and ID of the place from fb_json:
    > fb_place <- sapply(fb_json$data, function(e){e$place$name})
    > fb_id <- sapply(fb_json$data, function(e){e$place$id})
  10. Last, use data.frame to wrap the data:

    > data.frame(place = fb_place, id = fb_id)

How it works…

In this recipe, we covered how to retrieve social network data through Facebook's Graph API. Unlike scraping web pages, you need to obtain a Facebook access token before making any request for insight information. There are two ways to retrieve the access token: the first is to use Facebook's Graph API Explorer, and the other is to create a Facebook application. In this recipe, we illustrated how to use the Graph API Explorer to obtain the access token.

Facebook's Graph API Explorer is where you can craft your requests URL to access Facebook data on your behalf. To access the explorer page, we first visit Facebook's developer page (https://developers.facebook.com/). The Graph API Explorer page is under the drop-down menu of Tools & Support. After entering the explorer page, we select Get Access Token from the drop-down menu of Get Token. Subsequently, a tabbed window will appear; we can check access permission to various levels of the application. For example, we can check tagged_places to access the locations that we previously tagged. After we selected the permissions that we require, we can click on Get Access Token to allow Graph API Explorer to access our insight data. After completing these steps, you will see an access token, which is a temporary and short-lived token that you can use to access Facebook API.

With the access token, we can then access Facebook API with R. First, we need a HTTP request package. Similarly to the web scraping recipe, we can use the rvest package to make the request. We craft a request URL with the addition of the access_token (copied from Graph API Explorer) to the Facebook API. From the response, we should receive JSON formatted data. To read the attributes of the JSON format data, we install and load the RJSON package. We can then use the fromJSON function to read the JSON format string extracted from the response.

Finally, we read places and ID information through the use of the sapply function, and we can then use data.frame to transform extracted information to the data frame. At the end of this recipe, we should see data formatted in the data frame.

There's more...

To learn more about Graph API, you can read the official document from Facebook (https://developers.facebook.com/docs/reference/api/field_expansion/):

  1. First, we need to install and load the Rfacebook package:
    > install.packages("Rfacebook")
    > library(Rfacebook)
    
  2. We can then use built-in functions to retrieve data from the user or access similar information with the provision of an access token:
    > getUsers("me", "<access_token>")

If you want to scrape public fan pages without logging into Facebook every time, you can create a Facebook app to access insight information on behalf of the app.:

  1. To create an authorized app token, login to the Facebook developer page and click on Add a New Page:

    R for Data Science Cookbook

  2. You can create a new Facebook app with any name, providing that it has not already been registered:

    R for Data Science Cookbook

  3. Finally, you can copy both the app ID and app secret and craft the access token to <APP ID>|<APP SECRET>. You can now use this token to scrape public fan page information with Graph API:

    R for Data Science Cookbook

  4. Similarly to Rfacebook, we can then replace the access_token with <APP ID>|<APP SECRET>:
    > getUsers("me", "<access_token>")

Summary

In this article, we learned how to utilize R packages to read data from a text format and scan files line by line. We also learned how to scrape internet and social network data using the R web scraper. 

Resources for Article: 


Further resources on this subject:


You've been reading an excerpt of:

R for Data Science Cookbook

Explore Title