Web scraping is the process of extracting a structural representation of data from a website. The formatting language used to configure data on web pages may display HTML variability, because existing techniques for web scraping are based on markup. A change in HTML can lead to the removal of incorrect data.
Throughout this book, we will be using R to help us scrape data from web pages. R is an open source programming language and it's one of the most preferred programming languages among data scientists and researchers. R not only provides algorithms for statistical models and machine learning methods, but also provides a web scraping environment for researchers. The data collected from websites should also be stored somewhere. For this, we will learn to store the data in PostgreSQL databases, which we will do by using R.
As an example, a company may want to autonomously track product prices for its competitors. If the information does not provide a proprietary API, the solution is to write a program that targets the marking of the web page. A common approach is to parse the web page into a tree representation and resolve it with XPath expressions. If you have any questions like, Okay how can we make scripts run automatically? You will find the answer in this book.
The aim of this book is to offer a quick guide on web Scraping techniques and software that can be used to extract data from websites.
In this chapter, we will learn about the following topics:
- Data on the internet
- Introduction to XPath (XML Path)
- Data extraction systems
- Web scraping techniques
Data is an essential part of any research, whether it be academic, marketing, or scientific . The World Wide Web (WWW) contains all kinds of information from different sources. Some of these are social, financial, security, and academic resources and are accessible via the internet.
People may want to collect and analyse data from multiple websites. These different websites that belong to specific categories display information in different formats. Even with a single website, you may not be able to see all the data at once. The data may be spanned across multiple pages under various sections.
Most websites do not allow you to save a copy of the data to your local storage. The only option is to manually copy and paste the data shown by the website to a local file in your computer. This is a very tedious process that can take lot of time.
Web scraping is a technique by which people can extract data from multiple websites to a single spreadsheet or database so that it becomes easier to analyse or even visualize the data. Web scraping is used to transform unstructured data from the network into a centralized local database.
Well-known companies, including Google, Amazon, Wikipedia, Facebook, and many more, provide APIs (Application Programming Interfaces) that contain object classes that facilitate interaction with variables, data structures, and other software components. In this way, data collection from those websites is fast and can be performed without any web scraping software.
One of the most used features when performing web scraping of the semi-structured of web pages are naturally rooted trees that are labeled. On this trees, the tags represent the appropriate labels for the HTML markup language syntax, and the tree hierarchy represents the different nesting levels of the elements that make up the web page. The display of a web page using an ordered rooted tree labeled with a label is referred to as the DOM (Document Object Model), which is largely edited by the WWW Consortium.
The general idea behind the DOM is to represent HTML web pages via plain text with HTML tags, with custom key words defined in the sign language. This can be interpreted by the browser to represent web-specific items. HTML tags can be placed in a hierarchical structure. In this hierarchy, nodes in the DOM are captured by the document tree that represents the HTML tags. We will take a look at DOM structures while we focus on XPath rules.
An XPath represents a path, and when evaluated on a tree, the result is the node set at the end of any path in the tree. HTML, the formatting language used to configure the data in web pages, aims to create a visually appealing interface.
In particular, XML Path Language (XPath) provides powerful syntax for handling specific elements of an XML document and, to the same extent, HTML web pages, in a simple way. XPath is defined as a DOM by the World Wide Web Consortium.
There are two ways to use XPath:
- To identify a single item in the document tree
- To address multiple instances of the same item
The main weakness of XPath is its lack of flexibility. Each XPath expression is strictly related to the structure of the web page you are defining.
However, this limitation has been partially reduced since relative road expressions have been added in recent releases. In general, even small changes to the structure of a web page can cause an XPath expression that was defined in an earlier version of the page to not work correctly. In the following screenshot, you can see one XPath rule and its response:
A web data extraction system can be defined as a platform that implements a set of procedures that take information from web sources. In most cases, the average end users of Web Data Extraction systems are companies or data analysts looking for web-related information.
An intermediate user category often consists of non-specialized individuals who need to collect some web content, often non-regularly. This user category is often inexperienced and is looking for simple yet powerful Web Data Extraction software packages. DEiXTo is one of them. DEiXTo is based on the W3C Document Object Model and allows users to easily create inference rules that point to a portion of the data for digging from a website.
In addition, web scrapers can go places that traditional search engines cannot reach. By searching Google for cheap flights to Turkey, a large number of flights pop up, including advertising and other popular search sites. Google simply does not know what these websites actually say on their content pages; this is the exact consequence of having various queries entered into a flight search application. However, a well-developed web scraper will know the prices that vary over time of a flight to Turkey on various websites and can tell you the best time to purchase your ticket.
Web scraping techniques automatically open a new world for researchers by automatically extracting structured datasets from readable web content. A web scraper accesses web pages, finds the data items specified on the page, extracts them, transforms them into different formats if necessary, and finally saves this data as a structured dataset.
This can be described as pretending to know how a web browser works by accessing web pages and saving them to a computer's hard disk cache. Researchers use this content for analysis after cleaning and organizing data.
A web scraper reverses the process of manually gathering data from many web pages and putting together structured datasets from complex, unstructured text that spans thousands—even millions—of individual pages. Web scraping discussions often bring with them questions about legality and fair use.
In theory, web scraping is the practice of collecting data in any way other than a program interacting with an API. This is usually accomplished by writing an automated program that queries a web server, which usually requests data and then parses that data to extract the necessary information.
There are a lot of different types of web scraping techniques. In this section, the most popularly used web scraping techniques will be described and discussed.
Occasionally, due to our process of manual examination, the copy and paste method is one of the best and workable web scraping technologies. However, this is an error-prone, boring, and tiresome technique when people need to scrap lots of datasets (Web scraping, 2015).
This is a simple and powerful approach that's used to obtain information from web pages. This technique is based on UNIX commands or regular expression mapping features of the programming language.
By parsing a web browser such as Internet Explorer or Mozilla browser control, programs can import dynamic content that's been generated by client-side scripting. These browser controls break web pages into a DOM tree based on which programs can take sections of pages.
Pages that need to be scraped may contain metadata, semantic marks, or additional explanations that can be used to find specific data snippets. If the annotations are embedded in pages, such as Microformat, this technique is stored as a special case of DOM parsing, and additional annotations that are organized into a semantic layer are stored and managed separately from web pages. Thus, the scraper can get the data schema and instructions of this layer before scraping the pages.
It is possible to customize web scraping solutions. There are many software tools that can be used for this. These software tools provide a record interface that automatically recognizes the data structure of a page and removes the need to manually write web scraping code, or provides some script functions and database interfaces that can be used to extract and convert the content. Some of those tools are listed below;
- Diffbot: This is a tool that uses computational vision and machine learning algorithms that have been developed for collecting data from web pages automatically, in a behavior like a human being would perform.
- Heritrix: This is a web crawler that was designed for web archiving.
- HTTrack: This is a web browser that is free and open source, and was initially designed to scrape websites. It can also work offline.
- Selenium (software): This is used for testing the frameworks of web applications.
- OutWit Hub: This special scraper is a web scraping application that has built-in data, image, document extractors, and editors that are used for automatic search and extraction.
- Wget: This is a computer program that receives content from websites that supports access to websites through HTTP, HTTPS, and FTP protocols.
- WSO2 Mashup Server: This tool lets you to gain information based on the web from different sources like web services.
- Yahoo! Query Language (YQL): This is a query-like language similar to that of SQL that lets you query, filter, and join data across web services.
The following can be utilized to build web scrapers:
- Scrapy: Scrapy is a free and open source web crawling platform written in Python that was originally designed for scraping the web. It is also possible to use Scrapy as a general purpose web scraping tool if you use its new version and APIs.
- rvest: rvest is an R package that was written by Hadley Wickham that allows simple data collection from HTML web pages.
- RSelenium: RSelenium is designed to make it easy to connect to a Selenium Server/Remote Selenium Server. RSelenium allows connections from the R environment to the Selenium Webdriver API.
R provides various packages to assist in web search operations. These include XML, RCurl, and RJSON/RJSONIO/JASONLite. The XML package helps to parse XML and HTML, and provides XPath support for searching XML.
Web scraping is based on the sum of unstructured data, mostly text, from the web. Resources such as the internet, blogs, online newspapers, and social networking platforms provide a large amount of text data. This is especially important for researchers who conduct research in areas such as Social Sciences and Linguistics. Companies like Google, Facebook, Twitter, and Amazon provide APIs that allow analysts to retrieve data.
You can access these APIs with the R tool and collect data. For Google services, the RGoogleStorage and RogleMap packages are available. The TwitteR and streamR packages are used to retrieve data from Twitter.
For Amazon services, there is the AWS tools package, which provides access to Amazon Web Services (EC2/S3) and MTurkR packages that provide access to the Amazon Mechanical Turk Requester API. To access news bulletins, the GuardianR package can be used. This package provides an interface to the Content API of the Guardian Media Group's Open Platform.
The RNYTimes package on the same shelf also provides broad access to New York Times web services, including researchers' articles, metadata, user-generated content, and offers access to content.
There are also some R packages that provide a web scraping environment in R. In this book, we will also look at two packages that are well-known and used the most: rvest and RSelenium.
The rvest is inspired by the beautiful soup library, while HTML is a package that simplifies data scraping from web pages. It is designed to work with the magrittr package. Thus, it is easy and practical to create web-based search scripts consisting of simple, easy-to-understand parts.
Selenium web is a web automation tool that was originally developed specifically for scraping. However, with Selenium, you can develop web-scavenging scripts. Selenium can also run web browsers. Since Selenium can run web browsers, all content must be created in the browser, which can slow down the data collection process.
There are browsers like phantomjs that speed up this process. The RSelenium package allows you to connect to a Selenium Server. RSelenium allows for unit testing and regression testing on a variety of browsers, operating systems, web apps, and web pages.
In this chapter, we talked about the important rules of RegEx and Xpath. We talked about the general idea of web scraping and web crawling. We then described web scraping techniques, methodologies, and frameworks, before finally introducing the various web scraping environments that are available in R.
In the next chapter, we will learn about XPath rules. We will focus on XPath methodology and how to write XPath rules. We will then learn about the main idea behind these rules and put them into practice. After writing our first RegEx and Xpath rules, we will jump into writing our first web scraper by using R. The RSelenium and rvest libraries are going to be used throughout this book.