Reader small image

You're reading from  Learning R Programming

Product typeBook
Published inOct 2016
Reading LevelBeginner
PublisherPackt
ISBN-139781785889776
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Kun Ren
Kun Ren
author image
Kun Ren

Kun Ren has used R for nearly 4 years in quantitative trading, along with C++ and C#, and he has worked very intensively (more than 8-10 hours every day) on useful R packages that the community does not offer yet. He contributes to packages developed by other authors and reports issues to make things work better. He is also a frequent speaker at R conferences in China and has given multiple talks. Kun also has a great social media presence. Additionally, he has substantially contributed to various projects, which is evident from his GitHub account: https://github.com/renkun-ken https://cn.linkedin.com/in/kun-ren-76027530 http://renkun.me/ http://renkun.me/formattable/ http://renkun.me/pipeR/ http://renkun.me/rlist/
Read more about Kun Ren

Right arrow

Chapter 14. Web Scraping

R provides a platform with easy access to statistical computing and data analysis. Given a data set, it is handy to perform data transformation and apply analytic models and numeric methods with either flexible data structures or high performance, as discussed in previous chapters.

However, the input data set is not always as immediately available as tables provided by well-organized commercial databases. Sometimes, we have to collect data by ourselves. Web content is an important source of data for a wide range of research fields. To collect (scrape or harvest) data from the Internet, we need appropriate techniques and tools. In this chapter, we'll introduce the basic knowledge and tools of web scraping, including:

  • Looking inside web pages

  • Learning CSS and XPath selector

  • Analyzing HTML code and extracting data

Looking inside web pages


Web pages are made to present information. The following screenshot shows a simple web page located at data/simple-page.html that has a heading and a paragraph:

All modern web browsers support such web pages. If you open data/simple-page.html with any text editor, it will show the code behind the web page as follows:

<!DOCTYPE html> 
<html> 
<head> 
  <title>Simple page</title> 
</head> 
<body> 
  <h1>Heading 1</h1> 
  <p>This is a paragraph.</p> 
</body> 
</html> 

The preceding code is an example of HTML (Hyper Text Markup Language). It is the most widely used language on the Internet. Different from any programming language to be finally translated into computer instructions, HTML describes the layout and content of a web page, and web browsers are designed to render the code into a web page according to web standards.

Modern web browsers...

Extracting data from web pages using CSS selectors


In R, the easiest-to-use package for web scraping is rvest. Run the following code to install the package from CRAN:

install.packages("rvest") 

First, we load the package and use read_html() to read data/single-table.html and try to extract the table from the web page:

library(rvest) 
## Loading required package: xml2 
single_table_page <- read_html("data/single-table.html") 
single_table_page 
## {xml_document} 
## <html> 
## [1] <head>\n  <title>Single table</title>\n</head> 
## [2] <body>\n  <p>The following is a table</p>\n  <table i ... 

Note that single_table_page is a parsed HTML document, which is a nested data structure of HTML nodes.

A typical process for scraping information from such a web page using rvest functions is: First, locate the HTML nodes from which we need to extract data. Then, use either the CSS selector or XPath expression...

Learning XPath selectors


In the previous section, we learned about CSS selectors and how to use them as well as functions provided by the rvest package to extract contents from web pages.

CSS selectors are powerful enough to serve most needs of HTML node matching. However, sometimes an even more powerful technique is required to select nodes that meet more special conditions.

Take a look at the following web page a bit more complex than data/products.html:

This web page is stored as a standalone HTML file at data/new-products.html. The full source code is long we will only show the <body>. here. Please go through the source code to get an impression of its structure:

<body> 
  <h1>New Products</h1> 
  <p>The following is a list of products</p> 
  <div id="list" class="product-list"> 
    <ul> 
      <li> 
        <span class="name">Product-A</span> 
        <span class="price">$199.95<...

Analysing HTML code and extracting data


In the previous sections, we learned the basics of HTML, CSS, and XPath. To scrape real-world web pages, the problem now becomesa question of writing the proper CSS or XPath selectors. In this section, we introduce some simple ways to figure out working selectors.

Suppose we want to scrape all available R packages at https://cran.rstudio.com/web/packages/available_packages_by_name.html. The web page looks simple. To figure out the selector expression, right-click on the table and select Inspect Element in the context menu, which should be available in most modern web browsers:

Then the inspector panel shows up and we can see the underlying HTML of the web page. In Firefox and Chrome, the selected node is highlighted so it can be located more easily:

The HTML contains a unique <table> so we can directly use table to select it and use html_table() to extract it out as a data frame:

page <- read_html("https://cran.rstudio.com/web/packages/available_packages_by_name...

Summary


In this chapter, we learned how web pages are written in HTML and stylized by CSS. CSS selectors can be used to match HTML nodes so that their contents can be extracted. Well-written HTML documents can also be queried by XPath Expression, which has more features and is more flexible. Then we learned how to use the element inspector in modern web browsers to figure out a restrictive selector to match the HTML nodes of interest so that the needed data can be extracted from web pages.

In this next chapter, we will learn a series of techniques that boost your productivity, from R Markdown documents, diagrams, to interactive shiny apps. These tools make it much easier to create quality, reproducible, and interactive documents, which are very nice ways to present data, ideas, and prototypes.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning R Programming
Published in: Oct 2016Publisher: PacktISBN-13: 9781785889776
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Kun Ren

Kun Ren has used R for nearly 4 years in quantitative trading, along with C++ and C#, and he has worked very intensively (more than 8-10 hours every day) on useful R packages that the community does not offer yet. He contributes to packages developed by other authors and reports issues to make things work better. He is also a frequent speaker at R conferences in China and has given multiple talks. Kun also has a great social media presence. Additionally, he has substantially contributed to various projects, which is evident from his GitHub account: https://github.com/renkun-ken https://cn.linkedin.com/in/kun-ren-76027530 http://renkun.me/ http://renkun.me/formattable/ http://renkun.me/pipeR/ http://renkun.me/rlist/
Read more about Kun Ren