Reader small image

You're reading from  Python Web Scraping Cookbook

Product typeBook
Published inFeb 2018
Reading LevelBeginner
PublisherPackt
ISBN-139781787285217
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Michael Heydt
Michael Heydt
author image
Michael Heydt

Michael Heydt is an independent consultant, programmer, educator, and trainer. He has a passion for learning and sharing his knowledge of new technologies. Michael has worked in multiple industry verticals, including media, finance, energy, and healthcare. Over the last decade, he worked extensively with web, cloud, and mobile technologies and managed user experiences, interface design, and data visualization for major consulting firms and their clients. Michael's current company, Seamless Thingies , focuses on IoT development and connecting everything with everything. Michael is the author of numerous articles, papers, and books, such as D3.js By Example, Instant Lucene. NET, Learning Pandas, and Mastering Pandas for Finance, all by Packt. Michael is also a frequent speaker at .NET user groups and various mobile, cloud, and IoT conferences and delivers webinars on advanced technologies.
Read more about Michael Heydt

Right arrow

Scraping Challenges and Solutions

In this chapter, we will cover:

  • Retrying failed page downloads
  • Supporting page redirects
  • Waiting for content to be available in Selenium
  • Limiting crawling to a single domain
  • Processing infinitely scrolling pages
  • Controlling the depth of a crawl
  • Controlling the length of a crawl
  • Handling paginated websites
  • Handling forms and form-based authorization
  • Handling basic authorization
  • Preventing bans by scraping via proxies
  • Randomizing user agents
  • Caching responses

Introduction

Developing a reliable scraper is never easy, there are so many what ifs that we need to take into account. What if the website goes down? What if the response returns unexpected data? What if your IP is throttled or blocked? What if authentication is required? While we can never predict and cover all what ifs, we will discuss some common traps, challenges, and workarounds.

Note that several of the recipes require access to a website that I have provided as a Docker container. They require more logic than the simple, static site we used in earlier chapters. Therefore, you will need to pull and run a Docker container using the following Docker commands:

docker pull mheydt/pywebscrapecookbook
docker run -p 5001:5001 pywebscrapecookbook

Retrying failed page downloads

Failed page requests can be easily handled by Scrapy using retry middleware. When installed, Scrapy will attempt retries when receiving the following HTTP error codes:

[500, 502, 503, 504, 408]

The process can be further configured using the following parameters:

  • RETRY_ENABLED (True/False - default is True)
  • RETRY_TIMES (# of times to retry on any errors - default is 2)
  • RETRY_HTTP_CODES (a list of HTTP error codes which should be retried - default is [500, 502, 503, 504, 408])

How to do it

The 06/01_scrapy_retry.py script demonstrates how to configure Scrapy for retries. The script file contains the following configuration for Scrapy:

process = CrawlerProcess({
'LOG_LEVEL': &apos...

Supporting page redirects

Page redirects in Scrapy are handled using redirect middleware, which is enabled by default. The process can be further configured using the following parameters:

  • REDIRECT_ENABLED: (True/False - default is True)
  • REDIRECT_MAX_TIMES: (The maximum number of redirections to follow for any single request - default is 20)

How to do it

The script in 06/02_scrapy_redirects.py demonstrates how to configure Scrapy to handle redirects. This configures a maximum of two redirects for any page. Running the script reads the NASA sitemap and crawls that content. This contains a large number of redirects, many of which are redirects from HTTP to HTTPS versions of URLs. There will be a lot of output, but here are...

Waiting for content to be available in Selenium

A common problem with dynamic web pages is that even after the whole page has loaded, and hence the get() method in Selenium has returned, there still may be content that we need to access later as there are outstanding Ajax requests from the page that are still pending completion. An example of this is needing to click a button, but the button not being enabled until all data has been loaded asyncronously to the page after loading.

Take the following page as an example: http://the-internet.herokuapp.com/dynamic_loading/2. This page finishes loading very quickly and presents us with a Start button:

The Start button presented on screen

When pressing the button, we are presented with a progress bar for five seconds:

The status bar while waiting

And when this is completed, we are presented with Hello World!

After the page is completely...

Limiting crawling to a single domain

We can inform Scrapy to limit the crawl to only pages within a specified set of domains. This is an important task, as links can point to anywhere on the web, and we often want to control where crawls end up going. Scrapy makes this very easy to do. All that needs to be done is setting the allowed_domains field of your scraper class.

How to do it

The code for this example is 06/04_allowed_domains.py. You can run the script with your Python interpreter. It will execute and generate a ton of output, but if you keep an eye on it, you will see that it only processes pages on nasa.gov.

How it works...

Processing infinitely scrolling pages

Many websites have replaced "previous/next" pagination buttons with an infinite scrolling mechanism. These websites use this technique to load more data when the user has reached the bottom of the page. Because of this, strategies for crawling by following the "next page" link fall apart.

While this would seem to be a case for using browser automation to simulate the scrolling, it's actually quite easy to figure out the web pages' Ajax requests and use those for crawling instead of the actual page. Let's look at spidyquotes.herokuapp.com/scroll as an example.

Getting ready

Controlling the depth of a crawl

The depth of a crawl can be controlled using Scrapy DepthMiddleware middleware. The depth middleware limits the number of follows that Scrapy will take from any given link. This option can be useful for controlling how deep you go into a particular crawl. This is also used to keep a crawl from going on too long, and useful if you know that the content you are crawling for is located within a certain number of degrees of separation from the pages at the start of your crawl.

How to do it

The depth control middleware is installed in the middleware pipeline by default. An example of depth limiting is contained in the 06/06_limit_depth.py script. This script crawls the static site provided with...

Controlling the length of a crawl

The length of a crawl, in terms of number of pages that can be parsed, can be controlled with the CLOSESPIDER_PAGECOUNT setting.

How to do it

We will be using the script in 06/07_limit_length.py. The script and scraper are the same as the NASA sitemap crawler with the addition of the following configuration to limit the number of pages parsed to 5:

if __name__ == "__main__":
process = CrawlerProcess({
'LOG_LEVEL': 'INFO',
'CLOSESPIDER_PAGECOUNT': 5
})
process.crawl(Spider)
process.start()

When this is run, the following output will be generated (interspersed in the logging output):

<200 https://www.nasa.gov/exploration...

Handling paginated websites

Pagination breaks large sets of content into a number of pages. Normally, these pages have a previous/next page link for the user to click. These links can generally be found with XPath or other means and then followed to get to the next page (or previous). Let's examine how to traverse across pages with Scrapy. We'll look at a hypothetical example of crawling the results of an automated internet search. The techniques directly apply to many commercial sites with search capabilities, and are easily modified for those situations.

Getting ready

We will demonstrate handling pagination with an example that crawls a set of pages from the website in the provided container. This website models...

Handling forms and forms-based authorization

We are often required to log into a site before we can crawl its content. This is usually done through a form where we enter a user name and password, press Enter, and then granted access to previously hidden content. This type of form authentication is often called cookie authorization, as when we authorize, the server creates a cookie that it can use to verify that you have signed in. Scrapy respects these cookies, so all we need to do is somehow automate the form during our crawl.

Getting ready

We will crawl a page in the containers web site at the following URL: http://localhost:5001/home/secured. On this page, and links from that page, there is content we would like to scrape...

Handling basic authorization

Some websites use a form of authorization known as basic authorization. This was popular before other means of authorization, such as cookie auth or OAuth. It is also common on corporate intranets and some web APIs. In basic authorization, a header is added to the HTTP request. This header, Authorization, is passed the Basic string and then a base64 encoding of the values <username>:<password>. So in the case of darkhelmet, this header would look as follows:

Authorization: Basic ZGFya2hlbG1ldDp2ZXNwYQ==, with ZGFya2hlbG1ldDp2ZXNwYQ== being darkhelmet:vespa base 64 encoded.

Note that this is no more secure than sending it in plain-text, (although when performed over HTTPS it is secure.) However, for the most part, is has been subsumed for more robust authorization forms, and even cookie authorization allows for more complex features such...

Preventing bans by scraping via proxies

Sometimes you may get blocked by a site that your are scraping because you are identified as a scraper, and sometimes this happens because the webmaster sees the scrape requests coming from a uniform IP, at which point they simply block access to that IP.

To help prevent this problem, it is possible to use proxy randomization middleware within Scrapy. There exists a library, scrapy-proxies, which implements a proxy randomization feature.

Getting ready

How...

Randomizing user agents

Which user agent you use can have an effect on the success of your scraper. Some websites will flat out refuse to serve content to specific user agents. This can be because the user agent is identified as a scraper that is banned, or the user agent is for an unsupported browser (namely Internet Explorer 6).

Another reason for control over the scraper is that content may be rendered differently by the web server depending on the specified user agent. This is currently common for mobile sites, but it can also be used for desktops, to do things such as delivering simpler content for older browsers.

Therefore, it can be useful to set the user agent to other values than the defaults. Scrapy defaults to a user agent named scrapybot. This can be configured by using the BOT_NAME parameter. If you use Scrapy projects, Scrapy will set the agent to the name of your...

Caching responses

Scrapy comes with the ability to cache HTTP requests. This can greatly reduce crawling times if pages have already been visited. By enabling the cache, Scrapy will store every request and response.

How to do it

There is a working example in the 06/10_file_cache.py script. In Scrapy, caching middleware is disabled by default. To enable this cache, set HTTPCACHE_ENABLED to True and HTTPCACHE_DIR to a directory on the file system (using a relative path will create the directory in the project's data folder). To demonstrate, this script runs a crawl of the NASA site, and caches the content. It is configured using the following:

if __name__ == "__main__":
process = CrawlerProcess({
&apos...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Python Web Scraping Cookbook
Published in: Feb 2018Publisher: PacktISBN-13: 9781787285217
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Michael Heydt

Michael Heydt is an independent consultant, programmer, educator, and trainer. He has a passion for learning and sharing his knowledge of new technologies. Michael has worked in multiple industry verticals, including media, finance, energy, and healthcare. Over the last decade, he worked extensively with web, cloud, and mobile technologies and managed user experiences, interface design, and data visualization for major consulting firms and their clients. Michael's current company, Seamless Thingies , focuses on IoT development and connecting everything with everything. Michael is the author of numerous articles, papers, and books, such as D3.js By Example, Instant Lucene. NET, Learning Pandas, and Mastering Pandas for Finance, all by Packt. Michael is also a frequent speaker at .NET user groups and various mobile, cloud, and IoT conferences and delivers webinars on advanced technologies.
Read more about Michael Heydt