You're reading from R Web Scraping Quick Start Guide

Product typeBook

Published inOct 2018

Reading LevelBeginner

PublisherPackt

ISBN-139781789138733

Edition1st Edition

Languages

Concepts

Data Mining

Author (1)

Olgun Aydin

Web scraping techniques

Web scraping techniques automatically open a new world for researchers by automatically extracting structured datasets from readable web content. A web scraper accesses web pages, finds the data items specified on the page, extracts them, transforms them into different formats if necessary, and finally saves this data as a structured dataset.

This can be described as pretending to know how a web browser works by accessing web pages and saving them to a computer's hard disk cache. Researchers use this content for analysis after cleaning and organizing data.

A web scraper reverses the process of manually gathering data from many web pages and putting together structured datasets from complex, unstructured text that spans thousands—even millions—of individual pages. Web scraping discussions often bring with them questions about legality and fair use.

In theory, web scraping is the practice of collecting data in any way other than a program interacting with an API. This is usually accomplished by writing an automated program that queries a web server, which usually requests data and then parses that data to extract the necessary information.

There are a lot of different types of web scraping techniques. In this section, the most popularly used web scraping techniques will be described and discussed.

Traditional copy and paste

Occasionally, due to our process of manual examination, the copy and paste method is one of the best and workable web scraping technologies. However, this is an error-prone, boring, and tiresome technique when people need to scrap lots of datasets (Web scraping, 2015).

Text grabbing and regular expression

This is a simple and powerful approach that's used to obtain information from web pages. This technique is based on UNIX commands or regular expression mapping features of the programming language.

Document Object Model (DOM)

By parsing a web browser such as Internet Explorer or Mozilla browser control, programs can import dynamic content that's been generated by client-side scripting. These browser controls break web pages into a DOM tree based on which programs can take sections of pages.

Semantic annotation recognition

Pages that need to be scraped may contain metadata, semantic marks, or additional explanations that can be used to find specific data snippets. If the annotations are embedded in pages, such as Microformat, this technique is stored as a special case of DOM parsing, and additional annotations that are organized into a semantic layer are stored and managed separately from web pages. Thus, the scraper can get the data schema and instructions of this layer before scraping the pages.

Web scraping tools

It is possible to customize web scraping solutions. There are many software tools that can be used for this. These software tools provide a record interface that automatically recognizes the data structure of a page and removes the need to manually write web scraping code, or provides some script functions and database interfaces that can be used to extract and convert the content. Some of those tools are listed below;

Diffbot: This is a tool that uses computational vision and machine learning algorithms that have been developed for collecting data from web pages automatically, in a behavior like a human being would perform.
Heritrix: This is a web crawler that was designed for web archiving.
HTTrack: This is a web browser that is free and open source, and was initially designed to scrape websites. It can also work offline.
Selenium (software): This is used for testing the frameworks of web applications.
OutWit Hub: This special scraper is a web scraping application that has built-in data, image, document extractors, and editors that are used for automatic search and extraction.
Wget: This is a computer program that receives content from websites that supports access to websites through HTTP, HTTPS, and FTP protocols.
WSO2 Mashup Server: This tool lets you to gain information based on the web from different sources like web services.
Yahoo! Query Language (YQL): This is a query-like language similar to that of SQL that lets you query, filter, and join data across web services.

JavaScript tools

It is also possible to use JavaScript for web scraping tasks, mostly used JavaScript frameworks are listed as follows:

Node.js: Node.js is an open source, cross-platform JavaScript environment that allows JavaScript code to run without the need for a web browser.

PhantomJS: PhantomJS is a script-free and headless browser that's used to automate web pages with the JavaScript API that's provided.
jQuery: jQuery is a rich, cross-platform JavaScript library. With jQuery, which is easy to use and learn, it is possible to develop Ajax applications and mark objects in the DOM tree.

Web crawling frameworks

The following can be utilized to build web scrapers:

Scrapy: Scrapy is a free and open source web crawling platform written in Python that was originally designed for scraping the web. It is also possible to use Scrapy as a general purpose web scraping tool if you use its new version and APIs.
rvest: rvest is an R package that was written by Hadley Wickham that allows simple data collection from HTML web pages.
RSelenium: RSelenium is designed to make it easy to connect to a Selenium Server/Remote Selenium Server. RSelenium allows connections from the R environment to the Selenium Webdriver API.

Web crawling environment in R

R provides various packages to assist in web search operations. These include XML, RCurl, and RJSON/RJSONIO/JASONLite. The XML package helps to parse XML and HTML, and provides XPath support for searching XML.

The RCurl package uses various protocols to transfer data, generate general HTTP requests, retrieve URLs, send forms, and so on. All of this information is used for transactions. These processes use the libcurl library. JSON is an abbreviation of JavaScript Object Notation and is the most common data format used on the web. Rjson, RJSONIO, and JsonLite packages convert data in R into JSON format.

Web scraping is based on the sum of unstructured data, mostly text, from the web. Resources such as the internet, blogs, online newspapers, and social networking platforms provide a large amount of text data. This is especially important for researchers who conduct research in areas such as Social Sciences and Linguistics. Companies like Google, Facebook, Twitter, and Amazon provide APIs that allow analysts to retrieve data.

You can access these APIs with the R tool and collect data. For Google services, the RGoogleStorage and RogleMap packages are available. The TwitteR and streamR packages are used to retrieve data from Twitter.

For Amazon services, there is the AWS tools package, which provides access to Amazon Web Services (EC2/S3) and MTurkR packages that provide access to the Amazon Mechanical Turk Requester API. To access news bulletins, the GuardianR package can be used. This package provides an interface to the Content API of the Guardian Media Group's Open Platform.

The RNYTimes package on the same shelf also provides broad access to New York Times web services, including researchers' articles, metadata, user-generated content, and offers access to content.

There are also some R packages that provide a web scraping environment in R. In this book, we will also look at two packages that are well-known and used the most: rvest and RSelenium.

The rvest is inspired by the beautiful soup library, while HTML is a package that simplifies data scraping from web pages. It is designed to work with the magrittr package. Thus, it is easy and practical to create web-based search scripts consisting of simple, easy-to-understand parts.

Selenium web is a web automation tool that was originally developed specifically for scraping. However, with Selenium, you can develop web-scavenging scripts. Selenium can also run web browsers. Since Selenium can run web browsers, all content must be created in the browser, which can slow down the data collection process.

There are browsers like phantomjs that speed up this process. The RSelenium package allows you to connect to a Selenium Server. RSelenium allows for unit testing and regression testing on a variety of browsers, operating systems, web apps, and web pages.

You have been reading a chapter from

R Web Scraping Quick Start Guide

Published in: Oct 2018Publisher: PacktISBN-13: 9781789138733

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Olgun Aydin

Olgun Aydin is a PhD candidate at the Department of Statistics at Mimar Sinan University, and is studying deep learning for his thesis. He also works as a data scientist. Olgun is familiar with big data technologies, such as Hadoop and Spark, and is a very big fan of R. He has already published academic papers about the application of statistics, machine learning, and deep learning. He loves statistics, and loves to investigate new methods and share his experience with other people.
Read more about Olgun Aydin

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages