Reader small image

You're reading from  R Web Scraping Quick Start Guide

Product typeBook
Published inOct 2018
Reading LevelBeginner
PublisherPackt
ISBN-139781789138733
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Olgun Aydin
Olgun Aydin
author image
Olgun Aydin

Olgun Aydin is a PhD candidate at the Department of Statistics at Mimar Sinan University, and is studying deep learning for his thesis. He also works as a data scientist. Olgun is familiar with big data technologies, such as Hadoop and Spark, and is a very big fan of R. He has already published academic papers about the application of statistics, machine learning, and deep learning. He loves statistics, and loves to investigate new methods and share his experience with other people.
Read more about Olgun Aydin

Right arrow

Storing Data and Creating Cronjob

Cloud computing is a system of information technology that enables access to the configurable system resources at any time and can be accessed from anywhere, which can be quickly accessed with minimal administrative overhead on the internet. The cloud computing logic is based on the principle of sharing resources.

Instead of spending resources on computer infrastructure and the maintenance of institutions, cloud systems enable these resources to be broken down into basic businesses. Since the introduction of Amazon EC2 in 2006, the availability of high-capacity and high-speed networks, low-cost computers, and storage systems has been widespread. Hardware virtualization has also begun to be widely used.

In the 1960s, the first concepts of time sharing became popular with the launching of the Remote Job Entry (RJE). This terminology is often used...

Cloud engine models

The service-oriented architect evaluates everything as a service (EaaS, XaaS or AAS). Cloud computing providers offer services in various formats: Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS). In the following part, we are going to talk about those formats.

Infrastructure as a service (IaaS)

IaaS includes online services designed to respond to various low-level details of network infrastructure, such as physical computing resources, location, data segmentation, scaling, security, backup, and top-level APIs.

Linux containers work in isolated sections of a single Linux kernel running directly on physical hardware. Linux groups and namespaces are used to...

Some of the cloud services

In this part, we are going to talk about two well-known and mostly preferred cloud services that are Google Cloud Compute Engine and Amazon Web Services.

Amazon Web Services (AWS)

The AWS platform emerged in July, 2002, initially consisting of only a few different vehicles and services. Later in late 2003, Chris Pinkham and Benjamin Black announced that the AWS concept would be reformatted when they offered an Amazon article explaining a vision for a fully standardized, fully automated retail storage infrastructure and applying it intensively to web services such as storage and retrieval.

By offering access to virtual servers as a service, they are encouraging the company to generate revenue from...

Cronjob

Cron is a work timer that is used in Unix-like computer operating systems. Developers can use cron for jobs that need to be run regularly at specific times, dates, or intervals. Briefly, the main idea behind using cronjob is automating system maintenance or management.

Cron is one of the most appropriate solutions for planning repetitive tasks. Cron is managed by a configuration file that specifies shell commands for a crontab (cron table) to run periodically in a particular program. Crontab files are stored where work lists and other instructions given to the cron daemon are stored. Users can have their own individual crontab files and are usually only found in the cron files or /etc subdirectory.

The syntax for each line is a cron expression consisting of five fields, followed by a shell command to execute.

For example, assuming the following cron default shell is compatible...

Storing data and creating schedule jobs for web scraping

In this part, we are going to create a free-tier AWS RDS instance and write a script to connect this database by using the RPostgreSQL library. After writing this script, we will create a cronjob that automatizes web scraping and sending data to the database based on the scheduled time.

Creating an AWS RDS Instance

Let's take a look at how to create the PostgreSQL database on AWS. In this section, we will talk about how to create an AWS account and how to create an RDS instance step by step:

  1. First, visit the AWS page by using the following URL: https://aws.amazon.com/
  1. Then click My Account to go to the login screen, as shown in the following screenshot:
AWS...

Summary

In this chapter, we focused on the fundamentals of cloud computing. We have learned how to create AWS Instances and how to connect PostgreSQL database, which is hosted on AWS, by using R. Finally, we focused on creating cronjobs on R. Now you are able to create end-to-end web scraping system by using R.

In this book, we covered main topics regarding web scraping with R. We talked about the main idea behind web scraping, learned how Regex rules work, and learned how to use XPath rules and how to write them. We have created our first web-scraping script using the rvest library and even created a graph, using collected data.

Finally, we discussed about Selenium. Using the RSelenium library, we created a web scrapper on top of R. Then, we took a look at how to store the data we collected and how to schedule our web-scraping tools on R.

So, now you are ready to build your...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
R Web Scraping Quick Start Guide
Published in: Oct 2018Publisher: PacktISBN-13: 9781789138733
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Olgun Aydin

Olgun Aydin is a PhD candidate at the Department of Statistics at Mimar Sinan University, and is studying deep learning for his thesis. He also works as a data scientist. Olgun is familiar with big data technologies, such as Hadoop and Spark, and is a very big fan of R. He has already published academic papers about the application of statistics, machine learning, and deep learning. He loves statistics, and loves to investigate new methods and share his experience with other people.
Read more about Olgun Aydin