You're reading from Python Web Scraping Cookbook

Product type Book

Published in Feb 2018

Publisher Packt

ISBN-13 9781787285217

Pages 364 pages

Edition 1st Edition

Languages

Python

Concepts

Data Mining

Author (1):

Michael Heydt

Scraping - Code of Conduct

In this chapter, we will cover:

Scraping legality and scraping politely
Respecting robots.txt
Crawling using the sitemap
Crawling with delays
Using identifiable user agents
Setting the number of concurrent requests per domain
Using auto throttling
Caching responses

Introduction

While you can technically scrape any website, it is important to know whether scraping is legal or not. We will discuss scraping legal concerns, explore general rules of thumb, and see best practices to scrape politely and minimize potential damage to the target websites.

Scraping legality and scraping politely

There's no real code in this recipe. It's simply an exposition of some of the concepts related to the legal issues involved in scraping. I'm not a lawyer, so don't take anything I write here as legal advice. I'll just point out a few things you need to be concerned with when using a scraper.

Getting ready

The legality of scraping breaks down into two issues:

Ownership of content
Denial of service

Fundamentally, anything posted on the web is open for reading. Every time you load a page, any page, your browser downloads that content from the web server and visually presents it to you. So in a sense, you and your browser are already scraping anything you look...

Respecting robots.txt

Many sites want to be crawled. It is inherent in the nature of the beast: Web hosters put content on their sites to be seen by humans. But it is also important that other computers see the content. A great example is search engine optimization (SEO). SEO is a process where you actually design your site to be crawled by spiders such as Google, so you are actually encouraging scraping. But at the same time, a publisher may only want specific parts of their site crawled, and to tell crawlers to keep their spiders off of certain portions of the site, either it is not for sharing, or not important enough to be crawled and wast the web server resources.

The rules of what you are and are not allowed to crawl are usually contained in a file that is on most sites known as robots.txt. The robots.txt is a human readable but parsable file, which can be used to identify...

Crawling using the sitemap

A sitemap is a protocol that allows a webmaster to inform search engines about URLs on a website that are available for crawling. A webmaster would want to use this as they actually want their information to be crawled by a search engine. The webmaster wants to make that content available for you to find, at least through search engines. But you can also use this information to your advantage.

A sitemap lists the URLs on a site, and allows a webmasters to specify additional information about each URL:

When it was last updated
How often the content changes
How important the URL is in relation to others

Sitemaps are useful on websites where:

Some areas of the website are not available through the browsable interface; that is, you cannot reach those pages
Ajax, Silverlight, or Flash content is used but not normally processed by search engines
The site...

Crawling with delays

Fast scraping is considered a bad practice. Continuously pounding a website for pages can burn up CPU and bandwidth, and a robust site will identify you doing this and block your IP. And if you are unlucky, you might get a nasty letter for violating terms of service!

The technique of delaying requests in your crawler depends upon how your crawler is implemented. If you are using Scrapy, then you can set a parameter that informs the crawler how long to wait between requests. In a simple crawler just sequentially processing URLs in a list, you can insert a thread.sleep statement.

Things can get more complicated if you have implemented a distributed cluster of crawlers that spread the load of page requests, such as using a message queue with competing consumers. That can have a number of different solutions, which are beyond the scope provided in this context...

Using identifiable user agents

What happens if you violate the terms of service and get flagged by the website owner? How can you help the site owners in contacting you, so that they can nicely ask you to back off to what they consider a reasonable level of scraping?

What you can do to facilitate this is add info about yourself in the User-Agent header of the requests. We have seen an example of this in robots.txt files, such as from amazon.com. In their robots.txt is an explicit statement of a user agent for Google: GoogleBot.

During scraping, you can embed your own information within the User-Agent header of the HTTP requests. To be polite, you can enter something such as 'MyCompany-MyCrawler (mybot@mycompany.com)'. The remote server, if tagging you in violation, will definitely be capturing this information, and if provided like this, it gives them a convenient means...

Setting the number of concurrent requests per domain

It is generally inefficient to crawl a site one URL at a time. Therefore, there is normally a number of simultaneous page requests made to the target site at any given time. Normally, the remote web server can quite effectively handle multiple simultaneous requests, and on your end you are just waiting for data to come back in for each, so concurrency generally works well for your scraper.

But this is also a pattern that smart websites can identify and flag as suspicious activity. And there are practical limits on both your crawler's end and the website. The more concurrent requests that are made, the more memory, CPU, network connections, and network bandwidth is required on both sides. These have costs involved, and there are practical limits on these values too.

So it is generally a good practice to set a limit on the...

Using auto throttling

Fairly closely tied to controlling the maximum level of concurrency is the concept of throttling. Websites vary in their ability to handle requests, both across multiple websites and on a single website at different times. During periods of slower response times, it makes sense to lighten up of the number of requests during that time. This can be a tedious process to monitor and adjust by hand.

Fortunately for us, scrapy also provides an ability to do this via an extension named AutoThrottle.

How to do it

AutoThrottle can easily be configured using the AUTOTHROTTLE_TARGET_CONCURRENCY setting:

process = CrawlerProcess({
    'AUTOTHROTTLE_TARGET_CONCURRENCY': 3
})
process.crawl(Spider)
process...

Using an HTTP cache for development

The development of a web crawler is a process of exploration, and one that will iterate through various refinements to retrieve the requested information. During the development process, you will often be hitting remote servers, and the same URLs on those servers, over and over. This is not polite. Fortunately, scrapy also comes to the rescue by providing caching middleware that is specifically designed to help in this situation.

How to do it

Scrapy will cache requests using a middleware module named HttpCacheMiddleware. Enabling it is as simple as configuring the HTTPCACHE_ENABLED setting to True:

process = CrawlerProcess({
    'AUTOTHROTTLE_TARGET_CONCURRENCY': 3
})
process.crawl...

You're reading from Python Web Scraping Cookbook

Table of Contents (13) Chapters

Scraping - Code of Conduct

Introduction

Scraping legality and scraping politely

Getting ready

Respecting robots.txt

Crawling using the sitemap

Crawling with delays

Using identifiable user agents

Setting the number of concurrent requests per domain

Using auto throttling

How to do it

Using an HTTP cache for development

How to do it

Authors (1)

Other recommended products

Personalised recommendations for you

You're reading from Python Web Scraping Cookbook

Table of Contents (13) Chapters

Unlock this book and the full library FREE for 7 days

Authors (1)

Other recommended products

Personalised recommendations for you