You're reading from Go Web Scraping Quick Start Guide

Product typeBook

Published inJan 2019

Reading LevelIntermediate

PublisherPackt

ISBN-139781789615708

Edition1st Edition

Languages

Tools

golearn

Concepts

Data Analysis

Author (1)

Vincent Smith

Protecting Your Web Scraper

Now that you have built a web scraper that is capable of autonomously collecting information from various websites, there are a few things you should do to make sure it operates safely. A number of important measures should be taken to protect your web scraper. As you should be aware, nothing on the internet should be fully trusted if you do not have complete ownership of it.

In this chapter, we will discuss the following tools and techniques you will need to ensure your web scraper's safety:

Virtual private servers
Proxies
Virtual private networks
Whitelists and blacklists

Virtual private servers

When you make an HTTP request for a website, you are making a direct connection between your computer and the targeted server. By doing this, you are providing them with your machine's public IP address, which can be used to determine your general location, and your Internet Service Provider (ISP). Although this can't be tied directly back to your exact location, it could be used maliciously if its finds its way into the wrong hands. With this in mind, it is preferable to not expose any of your personal assets to untrusted servers.

Running your web scraper on a computer that is far removed from your physical location, with some sort of remote access, is a good way to decouple your web scraper from your personal computer. You can rent Virtual Private Server (VPS) instances from various providers on the web.

Some of the more notable companies include...

Proxies

The role of a proxy is to provide an additional layer of protection on top of your system. At its core, a proxy is a server that sits in between your web scraper and the target web server, and passes communication between the two. Your web scraper sends a request to the proxy server, which then forwards the request to the website. From the point of view of the website, the request only comes from the proxy server, without any knowledge of the origin of the request. There are many types of proxy available, each with its own pros and cons

Public and shared proxies

Some proxies are open to the public to use. However, they can be shared by many different people. This jeopardizes your reliability because, if other users...

Virtual private networks

Depending on your need, you may need to connect to a Virtual Private Network (VPN) in order to ensure that all of your web scraping traffic is hidden. Where proxies provide a layer of protection by masking the IP address of your web scraper, a VPN also masks the data that flows between your scraper and the target site through an encrypted tunnel. This will make the content that you are scraping invisible to ISPs and anyone else with access to your network.

VPNs are not legal in all countries. Please comply with local laws.

There are many companies that offer VPN access, with costs typically ranging from $5 to $15 per month.

Some recommended companies are listed as follows:

Vypr VPN
Express VPN
IPVanish VPN
Nord VPN

Configuring your web scraper to use the VPN is different from proxies. VPNs usually require a specific client to connect your machine...

Boundaries

When you are crawling a website, you may not always know where you will end up. Many links in web pages take you to external sites that you may not trust as much as your target sites. These linked pages could contain irrelevant information or could be used for malicious purposes. It is important to define boundaries for your web scraper to safely navigate through unknown sources.

Whitelists

Whitelisting domains is a process by means of which you explicitly allow your scraper to access certain websites. Any site listed on the whitelist is OK for the web scraper to access, whereas any site that is not listed is automatically skipped. This is a simple way to ensure that your scraper only accesses pages for a small...

Summary

In this chapter, we reviewed a number of different techniques to ensure that we and our web scrapers are protected while browsing the internet. By using VPS, we are protecting our personal assets from malicious activity and discoverability on the internet. Proxies also help restrict information about the source of internet traffic, providing a layer of anonymity. VPNs add an extra layer of security over proxies by creating an encrypted tunnel for our data to flow through. Finally, creating whitelists and blacklists ensures that your scraper will not venture too deep into uncharted and undesirable places.

In Chapter 7, Scraping with Concurrency, we will look at how to use concurrency in order to increase the scale of our web scraper without the added cost of incorporating extra resources.

The rest of the chapter is locked

You have been reading a chapter from

Go Web Scraping Quick Start Guide

Published in: Jan 2019Publisher: PacktISBN-13: 9781789615708

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Vincent Smith

Vincent Smith has been a software engineer for 10 years, having worked in various fields from health and IT to machine learning, and large-scale web scrapers. He has worked for both large-scale Fortune 500 companies and start-ups alike and has sharpened his skills from the best of both worlds. While obtaining a degree in electrical engineering, he learned the foundations of writing good code through his Java courses. These basics helped spur his career in software development early in his professional career in order to provide support for his team. He fell in love with the process of teaching computers how to behave and set him on the path he still walks today.
Read more about Vincent Smith

Other recommended products

Related to this chapter

R Web Scraping Quick Start Guide

Web scraping is a technique to extract data from websites. It simulates the behavior of a website user to turn the website itself into a web service to retrieve or introduce new data. This book gives you all you need to get started with scraping web pages using R programming.

BookOct 2018114 pages

Python Web Scraping

This book is the ultimate guide to using latest features of Python 3.x to scrape data from websites. Learn right from extracting data from static web pages to creating class-based scrapers with Scrapy libraries. This book will also help you build crawlers and determine how to scrape data from JavaScript dependent website using PyQt and Selenium. You will also explore testing websites with scrapers, remote scraping, best practices, working with images and many more.

BookMay 2017220 pages

Go Standard Library Cookbook

Google’s Golang is the next talk of the town, with amazing features and a powerful library. This book will gear you up by taking you through recipes that will teach you how to leverage the standard library to implement a particular solution. This will enable Go developers to take advantage of using a rock-solid standard library instead of third-party frameworks

BookFeb 2018340 pages

Security with Go

Since Go has become enormously popular, Go's obvious advantages, like stability, speed and simplicity, make it a first class choice to develop security-oriented scripts and applications. Security with Go is a classical title for security developers, with its emphasis on Go. Based on John Leon's first mover experience, He starts out basic forensics and intrusion detection, and then switches tack from defense to attack, for example brute force attacks and host discovery. In all, this title enables you to use Go for all your security-related tasks.

BookJan 2018340 pages

Hands-On Web Scraping with Python

Web scraping is an essential technique used in many organizations to scrape valuable data from web pages. This book will help you master web scraping techniques and methodologies using Python libraries and other popular tools such as Selenium. By the end of this book, you will have learned how to efficiently scrape different websites.

BookJul 2019350 pages

The Go Workshop

The Go Workshop takes you from being a novice Go programmer to a confident developer who can leverage the key features of the language to build real-world applications. This book helps you cut through excessive theory and delve into the practical features and techniques that are commonly applied to design performant, scalable applications.

BookDec 2019824 pages

Distributed Computing with Go

To learn all of Go, a developer has to be conversant with Go concurrency and parallelism in theory and practice. Distributed Computing with Go takes the reader from concurrency using Goroutines and Channels to the full range of web and cloud environments where Go applications are usually deployed. Concurrency achieves scalability and resiliency, and Golang not only enables, but also frames development in such as a way as to give the developer a natural path towards both.

BookFeb 2018246 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages