What do you get with Print?

Instant access to your digital eBook copy whilst your Print order is Shipped

Black & white paperback book shipped to your address

Download this book in EPUB and PDF formats

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

Buy Now

Introduction to Web Scraping

Web scraping is the process of extracting a structural representation of data from a website. The formatting language used to configure data on web pages may display HTML variability, because existing techniques for web scraping are based on markup. A change in HTML can lead to the removal of incorrect data.

Throughout this book, we will be using R to help us scrape data from web pages. R is an open source programming language and it's one of the most preferred programming languages among data scientists and researchers. R not only provides algorithms for statistical models and machine learning methods, but also provides a web scraping environment for researchers. The data collected from websites should also be stored somewhere. For this, we will learn to store the data in PostgreSQL databases, which we will do by using R.

As an example, a company may want to autonomously track product prices for its competitors. If the information does not provide a proprietary API, the solution is to write a program that targets the marking of the web page. A common approach is to parse the web page into a tree representation and resolve it with XPath expressions. If you have any questions like, Okay how can we make scripts run automatically? You will find the answer in this book.

The aim of this book is to offer a quick guide on web Scraping techniques and software that can be used to extract data from websites.

In this chapter, we will learn about the following topics:

Data on the internet
Introduction to XPath (XML Path)
Data extraction systems
Web scraping techniques

Learning about data on the internet

Data is an essential part of any research, whether it be academic, marketing, or scientific . The World Wide Web (WWW) contains all kinds of information from different sources. Some of these are social, financial, security, and academic resources and are accessible via the internet.

People may want to collect and analyse data from multiple websites. These different websites that belong to specific categories display information in different formats. Even with a single website, you may not be able to see all the data at once. The data may be spanned across multiple pages under various sections.

Most websites do not allow you to save a copy of the data to your local storage. The only option is to manually copy and paste the data shown by the website to a local file in your computer. This is a very tedious process that can take lot of time.

Web scraping is a technique by which people can extract data from multiple websites to a single spreadsheet or database so that it becomes easier to analyse or even visualize the data. Web scraping is used to transform unstructured data from the network into a centralized local database.

Well-known companies, including Google, Amazon, Wikipedia, Facebook, and many more, provide APIs (Application Programming Interfaces) that contain object classes that facilitate interaction with variables, data structures, and other software components. In this way, data collection from those websites is fast and can be performed without any web scraping software.

One of the most used features when performing web scraping of the semi-structured of web pages are naturally rooted trees that are labeled. On this trees, the tags represent the appropriate labels for the HTML markup language syntax, and the tree hierarchy represents the different nesting levels of the elements that make up the web page. The display of a web page using an ordered rooted tree labeled with a label is referred to as the DOM (Document Object Model), which is largely edited by the WWW Consortium.

The general idea behind the DOM is to represent HTML web pages via plain text with HTML tags, with custom key words defined in the sign language. This can be interpreted by the browser to represent web-specific items. HTML tags can be placed in a hierarchical structure. In this hierarchy, nodes in the DOM are captured by the document tree that represents the HTML tags. We will take a look at DOM structures while we focus on XPath rules.

Web scraping techniques

Web scraping techniques automatically open a new world for researchers by automatically extracting structured datasets from readable web content. A web scraper accesses web pages, finds the data items specified on the page, extracts them, transforms them into different formats if necessary, and finally saves this data as a structured dataset.

This can be described as pretending to know how a web browser works by accessing web pages and saving them to a computer's hard disk cache. Researchers use this content for analysis after cleaning and organizing data.

A web scraper reverses the process of manually gathering data from many web pages and putting together structured datasets from complex, unstructured text that spans thousands—even millions—of individual pages. Web scraping discussions often bring with them questions about legality and fair use.

In theory, web scraping is the practice of collecting data in any way other than a program interacting with an API. This is usually accomplished by writing an automated program that queries a web server, which usually requests data and then parses that data to extract the necessary information.

There are a lot of different types of web scraping techniques. In this section, the most popularly used web scraping techniques will be described and discussed.

Traditional copy and paste

Occasionally, due to our process of manual examination, the copy and paste method is one of the best and workable web scraping technologies. However, this is an error-prone, boring, and tiresome technique when people need to scrap lots of datasets (Web scraping, 2015).

Text grabbing and regular expression

This is a simple and powerful approach that's used to obtain information from web pages. This technique is based on UNIX commands or regular expression mapping features of the programming language.

Document Object Model (DOM)

By parsing a web browser such as Internet Explorer or Mozilla browser control, programs can import dynamic content that's been generated by client-side scripting. These browser controls break web pages into a DOM tree based on which programs can take sections of pages.

Semantic annotation recognition

Pages that need to be scraped may contain metadata, semantic marks, or additional explanations that can be used to find specific data snippets. If the annotations are embedded in pages, such as Microformat, this technique is stored as a special case of DOM parsing, and additional annotations that are organized into a semantic layer are stored and managed separately from web pages. Thus, the scraper can get the data schema and instructions of this layer before scraping the pages.

Web scraping tools

It is possible to customize web scraping solutions. There are many software tools that can be used for this. These software tools provide a record interface that automatically recognizes the data structure of a page and removes the need to manually write web scraping code, or provides some script functions and database interfaces that can be used to extract and convert the content. Some of those tools are listed below;

Diffbot: This is a tool that uses computational vision and machine learning algorithms that have been developed for collecting data from web pages automatically, in a behavior like a human being would perform.
Heritrix: This is a web crawler that was designed for web archiving.
HTTrack: This is a web browser that is free and open source, and was initially designed to scrape websites. It can also work offline.
Selenium (software): This is used for testing the frameworks of web applications.
OutWit Hub: This special scraper is a web scraping application that has built-in data, image, document extractors, and editors that are used for automatic search and extraction.
Wget: This is a computer program that receives content from websites that supports access to websites through HTTP, HTTPS, and FTP protocols.
WSO2 Mashup Server: This tool lets you to gain information based on the web from different sources like web services.
Yahoo! Query Language (YQL): This is a query-like language similar to that of SQL that lets you query, filter, and join data across web services.

JavaScript tools

It is also possible to use JavaScript for web scraping tasks, mostly used JavaScript frameworks are listed as follows:

Node.js: Node.js is an open source, cross-platform JavaScript environment that allows JavaScript code to run without the need for a web browser.

PhantomJS: PhantomJS is a script-free and headless browser that's used to automate web pages with the JavaScript API that's provided.
jQuery: jQuery is a rich, cross-platform JavaScript library. With jQuery, which is easy to use and learn, it is possible to develop Ajax applications and mark objects in the DOM tree.

Web crawling frameworks

The following can be utilized to build web scrapers:

Scrapy: Scrapy is a free and open source web crawling platform written in Python that was originally designed for scraping the web. It is also possible to use Scrapy as a general purpose web scraping tool if you use its new version and APIs.
rvest: rvest is an R package that was written by Hadley Wickham that allows simple data collection from HTML web pages.
RSelenium: RSelenium is designed to make it easy to connect to a Selenium Server/Remote Selenium Server. RSelenium allows connections from the R environment to the Selenium Webdriver API.

Web crawling environment in R

R provides various packages to assist in web search operations. These include XML, RCurl, and RJSON/RJSONIO/JASONLite. The XML package helps to parse XML and HTML, and provides XPath support for searching XML.

The RCurl package uses various protocols to transfer data, generate general HTTP requests, retrieve URLs, send forms, and so on. All of this information is used for transactions. These processes use the libcurl library. JSON is an abbreviation of JavaScript Object Notation and is the most common data format used on the web. Rjson, RJSONIO, and JsonLite packages convert data in R into JSON format.

Web scraping is based on the sum of unstructured data, mostly text, from the web. Resources such as the internet, blogs, online newspapers, and social networking platforms provide a large amount of text data. This is especially important for researchers who conduct research in areas such as Social Sciences and Linguistics. Companies like Google, Facebook, Twitter, and Amazon provide APIs that allow analysts to retrieve data.

You can access these APIs with the R tool and collect data. For Google services, the RGoogleStorage and RogleMap packages are available. The TwitteR and streamR packages are used to retrieve data from Twitter.

For Amazon services, there is the AWS tools package, which provides access to Amazon Web Services (EC2/S3) and MTurkR packages that provide access to the Amazon Mechanical Turk Requester API. To access news bulletins, the GuardianR package can be used. This package provides an interface to the Content API of the Guardian Media Group's Open Platform.

The RNYTimes package on the same shelf also provides broad access to New York Times web services, including researchers' articles, metadata, user-generated content, and offers access to content.

There are also some R packages that provide a web scraping environment in R. In this book, we will also look at two packages that are well-known and used the most: rvest and RSelenium.

The rvest is inspired by the beautiful soup library, while HTML is a package that simplifies data scraping from web pages. It is designed to work with the magrittr package. Thus, it is easy and practical to create web-based search scripts consisting of simple, easy-to-understand parts.

Selenium web is a web automation tool that was originally developed specifically for scraping. However, with Selenium, you can develop web-scavenging scripts. Selenium can also run web browsers. Since Selenium can run web browsers, all content must be created in the browser, which can slow down the data collection process.

There are browsers like phantomjs that speed up this process. The RSelenium package allows you to connect to a Selenium Server. RSelenium allows for unit testing and regression testing on a variety of browsers, operating systems, web apps, and web pages.

Description

Web scraping is a technique to extract data from websites. It simulates the behavior of a website user to turn the website itself into a web service to retrieve or introduce new data. This book gives you all you need to get started with scraping web pages using R programming. You will learn about the rules of RegEx and Xpath, key components for scraping website data. We will show you web scraping techniques, methodologies, and frameworks. With this book's guidance, you will become comfortable with the tools to write and test RegEx and XPath rules. We will focus on examples of dynamic websites for scraping data and how to implement the techniques learned. You will learn how to collect URLs and then create XPath rules for your first web scraping script using rvest library. From the data you collect, you will be able to calculate the statistics and create R plots to visualize them. Finally, you will discover how to use Selenium drivers with R for more sophisticated scraping. You will create AWS instances and use R to connect a PostgreSQL database hosted on AWS. By the end of the book, you will be sufficiently confident to create end-to-end web scraping systems using R.

What do you get with Print?