Reader small image

You're reading from  Go Web Scraping Quick Start Guide

Product typeBook
Published inJan 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781789615708
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Vincent Smith
Vincent Smith
author image
Vincent Smith

Vincent Smith has been a software engineer for 10 years, having worked in various fields from health and IT to machine learning, and large-scale web scrapers. He has worked for both large-scale Fortune 500 companies and start-ups alike and has sharpened his skills from the best of both worlds. While obtaining a degree in electrical engineering, he learned the foundations of writing good code through his Java courses. These basics helped spur his career in software development early in his professional career in order to provide support for his team. He fell in love with the process of teaching computers how to behave and set him on the path he still walks today.
Read more about Vincent Smith

Right arrow

Protecting Your Web Scraper

Now that you have built a web scraper that is capable of autonomously collecting information from various websites, there are a few things you should do to make sure it operates safely. A number of important measures should be taken to protect your web scraper. As you should be aware, nothing on the internet should be fully trusted if you do not have complete ownership of it.

In this chapter, we will discuss the following tools and techniques you will need to ensure your web scraper's safety:

  • Virtual private servers
  • Proxies
  • Virtual private networks
  • Whitelists and blacklists

Virtual private servers

When you make an HTTP request for a website, you are making a direct connection between your computer and the targeted server. By doing this, you are providing them with your machine's public IP address, which can be used to determine your general location, and your Internet Service Provider (ISP). Although this can't be tied directly back to your exact location, it could be used maliciously if its finds its way into the wrong hands. With this in mind, it is preferable to not expose any of your personal assets to untrusted servers.

Running your web scraper on a computer that is far removed from your physical location, with some sort of remote access, is a good way to decouple your web scraper from your personal computer. You can rent Virtual Private Server (VPS) instances from various providers on the web.

Some of the more notable companies include...

Proxies

The role of a proxy is to provide an additional layer of protection on top of your system. At its core, a proxy is a server that sits in between your web scraper and the target web server, and passes communication between the two. Your web scraper sends a request to the proxy server, which then forwards the request to the website. From the point of view of the website, the request only comes from the proxy server, without any knowledge of the origin of the request. There are many types of proxy available, each with its own pros and cons

Public and shared proxies

Some proxies are open to the public to use. However, they can be shared by many different people. This jeopardizes your reliability because, if other users...

Virtual private networks

Depending on your need, you may need to connect to a Virtual Private Network (VPN) in order to ensure that all of your web scraping traffic is hidden. Where proxies provide a layer of protection by masking the IP address of your web scraper, a VPN also masks the data that flows between your scraper and the target site through an encrypted tunnel. This will make the content that you are scraping invisible to ISPs and anyone else with access to your network.

VPNs are not legal in all countries. Please comply with local laws.

There are many companies that offer VPN access, with costs typically ranging from $5 to $15 per month.

Some recommended companies are listed as follows:

  • Vypr VPN
  • Express VPN
  • IPVanish VPN
  • Nord VPN

Configuring your web scraper to use the VPN is different from proxies. VPNs usually require a specific client to connect your machine...

Boundaries

When you are crawling a website, you may not always know where you will end up. Many links in web pages take you to external sites that you may not trust as much as your target sites. These linked pages could contain irrelevant information or could be used for malicious purposes. It is important to define boundaries for your web scraper to safely navigate through unknown sources.

Whitelists

Whitelisting domains is a process by means of which you explicitly allow your scraper to access certain websites. Any site listed on the whitelist is OK for the web scraper to access, whereas any site that is not listed is automatically skipped. This is a simple way to ensure that your scraper only accesses pages for a small...

Summary

In this chapter, we reviewed a number of different techniques to ensure that we and our web scrapers are protected while browsing the internet. By using VPS, we are protecting our personal assets from malicious activity and discoverability on the internet. Proxies also help restrict information about the source of internet traffic, providing a layer of anonymity. VPNs add an extra layer of security over proxies by creating an encrypted tunnel for our data to flow through. Finally, creating whitelists and blacklists ensures that your scraper will not venture too deep into uncharted and undesirable places.

In Chapter 7, Scraping with Concurrency, we will look at how to use concurrency in order to increase the scale of our web scraper without the added cost of incorporating extra resources.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Go Web Scraping Quick Start Guide
Published in: Jan 2019Publisher: PacktISBN-13: 9781789615708
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Vincent Smith

Vincent Smith has been a software engineer for 10 years, having worked in various fields from health and IT to machine learning, and large-scale web scrapers. He has worked for both large-scale Fortune 500 companies and start-ups alike and has sharpened his skills from the best of both worlds. While obtaining a degree in electrical engineering, he learned the foundations of writing good code through his Java courses. These basics helped spur his career in software development early in his professional career in order to provide support for his team. He fell in love with the process of teaching computers how to behave and set him on the path he still walks today.
Read more about Vincent Smith