Preface

This book is for R programmers looking to quickly get started with web scraping. Some fundamental knowledge of R is required. This book will give you a quick, hands-on introduction to web scraping and how to use popular R libraries, such as rvest and RSelenium. Right from the initial environment setup to quickly scraping HTML web pages for useful information, this book will cover only the absolute fundamentals of web scraping without going into too much depth. By the end of the book, you will have the understanding that's necessary for scraping any web page using R programming.

Who this book is for

This book is for R programmers looking to quickly get started with web scraping, as well as data analysts who want to learn about scraping using R. Some fundamental knowledge of R is all that is required to get started with this book.

What this book covers

Chapter 1, Introduction to Web Scraping, introduces web scraping techniques, which are getting more and more popular, since data is as valuable as oil in the 21^st century. In this chapter, you can find detailed information about web scraping technologies. We also take an overview of some of the key languages for web scraping, such as XPath and regEX. We'll also look into some web scraping libraries for R, such as rvest and RSelenium technologies.

Chapter 2, Working with the XML Path Language and the Regular Expression Language, looks at XPath and regEX rules, which are quite important to know when scraping a web page. In this chapter, you can find useful information about these languages and also have a chance to write XPath and regEX rules from scratch.

Chapter 3, Web Scraping with rvest, covers the rvest library. Scraping a web page with R is straightforward thanks to the rvest library, which was developed by Hadley Wickham. In this chapter, you can find tips and tricks about the library and learn how to write an R script by using the rvest library to scrape a web page from scratch.

Chapter 4, Web Scraping with RSelenium, explores RSelenium. RSelenium is a technology for testing, but it's also useful for scraping web pages. In this chapter, you can find an overview of Selenium and learn how to scrape a web page using RSelenium library.

Chapter 5, Storing Data and Creating Cronjobs, deals with the matter of storage. After collecting data, you should store the dataset somewhere; it would be good if you could use a cloud-based solution, such as AWS RDS, EC2, Google Cloud Platform, or Microsoft Azure. Also, if you would like to schedule the collection of data, it's possible to create cronjob that will help you do so. In this chapter, you can find an overview of databases and cloud platforms, and you'll also learn how to connect databases and schedule cronjobs using R.

To get the most out of this book

To get the most out of this book, its important that you have an idea of what web scraping is. Also, it is advised that you have good hands-on experience with R programming.
If you have R and RStudio ready on your PC to get started, you will find the information of all packages that are required for scraping data within the chapters.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at www.packt.com.
Select the SUPPORT tab.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/R-Web-Scraping-Quick-Start-Guide. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/9781789138733_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

<publisher> (root element node)
<author>J Olgun Aydin</author> (element node)
lang="en" (attribute node)

Any command-line input or output is written as follows:

install.packages("RPostgreSQL")

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at customercare@packtpub.com.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.