Extracting data from web pages using CSS selectors
In R, the easiest-to-use package for web scraping is rvest. Run the following code to install the package from CRAN:
install.packages("rvest")
First, we load the package and use read_html() to read data/single-table.html and try to extract the table from the web page:
library(rvest)
## Loading required package: xml2
single_table_page <- read_html("data/single-table.html")
single_table_page
## {xml_document}
## <html>
## [1] <head>\n <title>Single table</title>\n</head>
## [2] <body>\n <p>The following is a table</p>\n <table i ...
Note that single_table_page is a parsed HTML document, which is a nested data structure of HTML nodes.
A typical process for scraping information from such a web page using rvest functions is: First, locate the HTML nodes from which we need to extract data. Then, use either the CSS selector or XPath expression to filter the HTML...