Packt+ | Advance your knowledge in tech

You're reading from Learning R Programming

Product typeBook

Published inOct 2016

Reading LevelBeginner

PublisherPackt

ISBN-139781785889776

Edition1st Edition

Languages

Tools

RStudio

Concepts

Programming Language

Author (1)

Kun Ren

Chapter 14. Web Scraping

R provides a platform with easy access to statistical computing and data analysis. Given a data set, it is handy to perform data transformation and apply analytic models and numeric methods with either flexible data structures or high performance, as discussed in previous chapters.

However, the input data set is not always as immediately available as tables provided by well-organized commercial databases. Sometimes, we have to collect data by ourselves. Web content is an important source of data for a wide range of research fields. To collect (scrape or harvest) data from the Internet, we need appropriate techniques and tools. In this chapter, we'll introduce the basic knowledge and tools of web scraping, including:

Looking inside web pages
Learning CSS and XPath selector
Analyzing HTML code and extracting data

Looking inside web pages

Web pages are made to present information. The following screenshot shows a simple web page located at data/simple-page.html that has a heading and a paragraph:

All modern web browsers support such web pages. If you open data/simple-page.html with any text editor, it will show the code behind the web page as follows:

<!DOCTYPE html> 
<html> 
<head> 
  <title>Simple page</title> 
</head> 
<body> 
  <h1>Heading 1</h1> 
  <p>This is a paragraph.</p> 
</body> 
</html>

The preceding code is an example of HTML (Hyper Text Markup Language). It is the most widely used language on the Internet. Different from any programming language to be finally translated into computer instructions, HTML describes the layout and content of a web page, and web browsers are designed to render the code into a web page according to web standards.

Modern web browsers...

Extracting data from web pages using CSS selectors

In R, the easiest-to-use package for web scraping is rvest. Run the following code to install the package from CRAN:

install.packages("rvest")

First, we load the package and use read_html() to read data/single-table.html and try to extract the table from the web page:

library(rvest) 
## Loading required package: xml2 
single_table_page <- read_html("data/single-table.html") 
single_table_page 
## {xml_document} 
## <html> 
## [1] <head>\n  <title>Single table</title>\n</head> 
## [2] <body>\n  <p>The following is a table</p>\n  <table i ...

Note that single_table_page is a parsed HTML document, which is a nested data structure of HTML nodes.

A typical process for scraping information from such a web page using rvest functions is: First, locate the HTML nodes from which we need to extract data. Then, use either the CSS selector or XPath expression...

Learning XPath selectors

In the previous section, we learned about CSS selectors and how to use them as well as functions provided by the rvest package to extract contents from web pages.

CSS selectors are powerful enough to serve most needs of HTML node matching. However, sometimes an even more powerful technique is required to select nodes that meet more special conditions.

Take a look at the following web page a bit more complex than data/products.html:

This web page is stored as a standalone HTML file at data/new-products.html. The full source code is long we will only show the <body>. here. Please go through the source code to get an impression of its structure:

<body> 
  <h1>New Products</h1> 
  <p>The following is a list of products</p> 
  <div id="list" class="product-list"> 
    <ul> 
      <li> 
        <span class="name">Product-A</span> 
        <span class="price">$199.95<...

Analysing HTML code and extracting data

In the previous sections, we learned the basics of HTML, CSS, and XPath. To scrape real-world web pages, the problem now becomesa question of writing the proper CSS or XPath selectors. In this section, we introduce some simple ways to figure out working selectors.

Suppose we want to scrape all available R packages at https://cran.rstudio.com/web/packages/available_packages_by_name.html. The web page looks simple. To figure out the selector expression, right-click on the table and select Inspect Element in the context menu, which should be available in most modern web browsers:

Then the inspector panel shows up and we can see the underlying HTML of the web page. In Firefox and Chrome, the selected node is highlighted so it can be located more easily:

The HTML contains a unique <table> so we can directly use table to select it and use html_table() to extract it out as a data frame:

page <- read_html("https://cran.rstudio.com/web/packages/available_packages_by_name...

Summary

In this chapter, we learned how web pages are written in HTML and stylized by CSS. CSS selectors can be used to match HTML nodes so that their contents can be extracted. Well-written HTML documents can also be queried by XPath Expression, which has more features and is more flexible. Then we learned how to use the element inspector in modern web browsers to figure out a restrictive selector to match the HTML nodes of interest so that the needed data can be extracted from web pages.

In this next chapter, we will learn a series of techniques that boost your productivity, from R Markdown documents, diagrams, to interactive shiny apps. These tools make it much easier to create quality, reproducible, and interactive documents, which are very nice ways to present data, ideas, and prototypes.

The rest of the chapter is locked

You have been reading a chapter from

Learning R Programming

Published in: Oct 2016Publisher: PacktISBN-13: 9781785889776

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Kun Ren

Kun Ren has used R for nearly 4 years in quantitative trading, along with C++ and C#, and he has worked very intensively (more than 8-10 hours every day) on useful R packages that the community does not offer yet. He contributes to packages developed by other authors and reports issues to make things work better. He is also a frequent speaker at R conferences in China and has given multiple talks. Kun also has a great social media presence. Additionally, he has substantially contributed to various projects, which is evident from his GitHub account: https://github.com/renkun-ken https://cn.linkedin.com/in/kun-ren-76027530 http://renkun.me/ http://renkun.me/formattable/ http://renkun.me/pipeR/ http://renkun.me/rlist/
Read more about Kun Ren

Personalised recommendations for you

Based on your interests and search pattern

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

BookSep 2023288 pages

Expert C++

Discover advanced programming techniques, the latest features of C++17 and C++20, and best practices for memory management, debugging, testing, and large-scale application design with Expert C++. Ideal for experienced developers advancing to proficient programmers and building professional-grade C++ applications.

BookAug 2023604 pages

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

BookOct 2023604 pages4

Developer Career Masterplan

Written by industry experts that have spent the last 20+ years helping developers grow their career path towards senior developer positions and beyond. This book provides a comprehensive guide, sharing examples and stories from their global careers. By the end, you’ll have the knowledge to create a clear career progression plan as a technical professional.

BookSep 2023310 pages

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

BookNov 2023434 pages

Python Real-World Projects

Amplify your developer journey by curating a dynamic project portfolio that outshines traditional resumes. Delve into the Python realm through immersive projects, mastering core concepts while constructing comprehensive modules and applications. From data acquisition prowess to impactful data visualization, Python Real-World Projects arms you with essential skills to beat the competition.

BookSep 2023478 pages5

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

BookNov 2023386 pages

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Quantum Computing Algorithms

The book emphasizes intuitive ideas behind quantum algorithms in ways that other books don’t cover, striking a careful balance between no math and too much math. To get the most from this book, you should be comfortable with basic algebra and writing simple computer code. No prior understanding of quantum physics is needed to get started.

BookSep 2023342 pages

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5