You're reading from Python Web Scraping Cookbook

Product typeBook

Published inFeb 2018

Reading LevelBeginner

PublisherPackt

ISBN-139781787285217

Edition1st Edition

Languages

Python

Tools

Scrapy

Concepts

Data Mining

Author (1)

Michael Heydt

Scraping Challenges and Solutions

In this chapter, we will cover:

Retrying failed page downloads
Supporting page redirects
Waiting for content to be available in Selenium
Limiting crawling to a single domain
Processing infinitely scrolling pages
Controlling the depth of a crawl
Controlling the length of a crawl
Handling paginated websites
Handling forms and form-based authorization
Handling basic authorization
Preventing bans by scraping via proxies
Randomizing user agents
Caching responses

Introduction

Developing a reliable scraper is never easy, there are so many what ifs that we need to take into account. What if the website goes down? What if the response returns unexpected data? What if your IP is throttled or blocked? What if authentication is required? While we can never predict and cover all what ifs, we will discuss some common traps, challenges, and workarounds.

Note that several of the recipes require access to a website that I have provided as a Docker container. They require more logic than the simple, static site we used in earlier chapters. Therefore, you will need to pull and run a Docker container using the following Docker commands:

docker pull mheydt/pywebscrapecookbook
docker run -p 5001:5001 pywebscrapecookbook

Retrying failed page downloads

Failed page requests can be easily handled by Scrapy using retry middleware. When installed, Scrapy will attempt retries when receiving the following HTTP error codes:

[500, 502, 503, 504, 408]

The process can be further configured using the following parameters:

RETRY_ENABLED (True/False - default is True)
RETRY_TIMES (# of times to retry on any errors - default is 2)
RETRY_HTTP_CODES (a list of HTTP error codes which should be retried - default is [500, 502, 503, 504, 408])

How to do it

The 06/01_scrapy_retry.py script demonstrates how to configure Scrapy for retries. The script file contains the following configuration for Scrapy:

process = CrawlerProcess({
    'LOG_LEVEL': &apos...

Supporting page redirects

Page redirects in Scrapy are handled using redirect middleware, which is enabled by default. The process can be further configured using the following parameters:

REDIRECT_ENABLED: (True/False - default is True)
REDIRECT_MAX_TIMES: (The maximum number of redirections to follow for any single request - default is 20)

How to do it

The script in 06/02_scrapy_redirects.py demonstrates how to configure Scrapy to handle redirects. This configures a maximum of two redirects for any page. Running the script reads the NASA sitemap and crawls that content. This contains a large number of redirects, many of which are redirects from HTTP to HTTPS versions of URLs. There will be a lot of output, but here are...

Waiting for content to be available in Selenium

A common problem with dynamic web pages is that even after the whole page has loaded, and hence the get() method in Selenium has returned, there still may be content that we need to access later as there are outstanding Ajax requests from the page that are still pending completion. An example of this is needing to click a button, but the button not being enabled until all data has been loaded asyncronously to the page after loading.

Take the following page as an example: http://the-internet.herokuapp.com/dynamic_loading/2. This page finishes loading very quickly and presents us with a Start button:

The Start button presented on screen

When pressing the button, we are presented with a progress bar for five seconds:

The status bar while waiting

And when this is completed, we are presented with Hello World!

After the page is completely...

Limiting crawling to a single domain

We can inform Scrapy to limit the crawl to only pages within a specified set of domains. This is an important task, as links can point to anywhere on the web, and we often want to control where crawls end up going. Scrapy makes this very easy to do. All that needs to be done is setting the allowed_domains field of your scraper class.

How to do it

The code for this example is 06/04_allowed_domains.py. You can run the script with your Python interpreter. It will execute and generate a ton of output, but if you keep an eye on it, you will see that it only processes pages on nasa.gov.

How it works...

Processing infinitely scrolling pages

Many websites have replaced "previous/next" pagination buttons with an infinite scrolling mechanism. These websites use this technique to load more data when the user has reached the bottom of the page. Because of this, strategies for crawling by following the "next page" link fall apart.

While this would seem to be a case for using browser automation to simulate the scrolling, it's actually quite easy to figure out the web pages' Ajax requests and use those for crawling instead of the actual page. Let's look at spidyquotes.herokuapp.com/scroll as an example.

Getting ready

Open http://spidyquotes.herokuapp.com/scroll in your browser. This page will load...

Controlling the depth of a crawl

The depth of a crawl can be controlled using Scrapy DepthMiddleware middleware. The depth middleware limits the number of follows that Scrapy will take from any given link. This option can be useful for controlling how deep you go into a particular crawl. This is also used to keep a crawl from going on too long, and useful if you know that the content you are crawling for is located within a certain number of degrees of separation from the pages at the start of your crawl.

How to do it

The depth control middleware is installed in the middleware pipeline by default. An example of depth limiting is contained in the 06/06_limit_depth.py script. This script crawls the static site provided with...

Controlling the length of a crawl

The length of a crawl, in terms of number of pages that can be parsed, can be controlled with the CLOSESPIDER_PAGECOUNT setting.

How to do it

We will be using the script in 06/07_limit_length.py. The script and scraper are the same as the NASA sitemap crawler with the addition of the following configuration to limit the number of pages parsed to 5:

if __name__ == "__main__":
    process = CrawlerProcess({
        'LOG_LEVEL': 'INFO',
        'CLOSESPIDER_PAGECOUNT': 5
    })
    process.crawl(Spider)
    process.start()

When this is run, the following output will be generated (interspersed in the logging output):

<200 https://www.nasa.gov/exploration...

Handling paginated websites

Pagination breaks large sets of content into a number of pages. Normally, these pages have a previous/next page link for the user to click. These links can generally be found with XPath or other means and then followed to get to the next page (or previous). Let's examine how to traverse across pages with Scrapy. We'll look at a hypothetical example of crawling the results of an automated internet search. The techniques directly apply to many commercial sites with search capabilities, and are easily modified for those situations.

Getting ready

We will demonstrate handling pagination with an example that crawls a set of pages from the website in the provided container. This website models...

Handling forms and forms-based authorization

We are often required to log into a site before we can crawl its content. This is usually done through a form where we enter a user name and password, press Enter, and then granted access to previously hidden content. This type of form authentication is often called cookie authorization, as when we authorize, the server creates a cookie that it can use to verify that you have signed in. Scrapy respects these cookies, so all we need to do is somehow automate the form during our crawl.

Getting ready

We will crawl a page in the containers web site at the following URL: http://localhost:5001/home/secured. On this page, and links from that page, there is content we would like to scrape...

Handling basic authorization

Some websites use a form of authorization known as basic authorization. This was popular before other means of authorization, such as cookie auth or OAuth. It is also common on corporate intranets and some web APIs. In basic authorization, a header is added to the HTTP request. This header, Authorization, is passed the Basic string and then a base64 encoding of the values <username>:<password>. So in the case of darkhelmet, this header would look as follows:

Authorization: Basic ZGFya2hlbG1ldDp2ZXNwYQ==, with ZGFya2hlbG1ldDp2ZXNwYQ== being darkhelmet:vespa base 64 encoded.

Note that this is no more secure than sending it in plain-text, (although when performed over HTTPS it is secure.) However, for the most part, is has been subsumed for more robust authorization forms, and even cookie authorization allows for more complex features such...

Preventing bans by scraping via proxies

Sometimes you may get blocked by a site that your are scraping because you are identified as a scraper, and sometimes this happens because the webmaster sees the scrape requests coming from a uniform IP, at which point they simply block access to that IP.

To help prevent this problem, it is possible to use proxy randomization middleware within Scrapy. There exists a library, scrapy-proxies, which implements a proxy randomization feature.

Getting ready

You can get scrapy-proxies from GitHub at https://github.com/aivarsk/scrapy-proxies or by installing it using pip install scrapy_proxies.

How...

Randomizing user agents

Which user agent you use can have an effect on the success of your scraper. Some websites will flat out refuse to serve content to specific user agents. This can be because the user agent is identified as a scraper that is banned, or the user agent is for an unsupported browser (namely Internet Explorer 6).

Another reason for control over the scraper is that content may be rendered differently by the web server depending on the specified user agent. This is currently common for mobile sites, but it can also be used for desktops, to do things such as delivering simpler content for older browsers.

Therefore, it can be useful to set the user agent to other values than the defaults. Scrapy defaults to a user agent named scrapybot. This can be configured by using the BOT_NAME parameter. If you use Scrapy projects, Scrapy will set the agent to the name of your...

Caching responses

Scrapy comes with the ability to cache HTTP requests. This can greatly reduce crawling times if pages have already been visited. By enabling the cache, Scrapy will store every request and response.

How to do it

There is a working example in the 06/10_file_cache.py script. In Scrapy, caching middleware is disabled by default. To enable this cache, set HTTPCACHE_ENABLED to True and HTTPCACHE_DIR to a directory on the file system (using a relative path will create the directory in the project's data folder). To demonstrate, this script runs a crawl of the NASA site, and caches the content. It is configured using the following:

if __name__ == "__main__":
    process = CrawlerProcess({
        &apos...

The rest of the chapter is locked

You have been reading a chapter from

Python Web Scraping Cookbook

Published in: Feb 2018Publisher: PacktISBN-13: 9781787285217

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Michael Heydt

Michael Heydt is an independent consultant, programmer, educator, and trainer. He has a passion for learning and sharing his knowledge of new technologies. Michael has worked in multiple industry verticals, including media, finance, energy, and healthcare. Over the last decade, he worked extensively with web, cloud, and mobile technologies and managed user experiences, interface design, and data visualization for major consulting firms and their clients. Michael's current company, Seamless Thingies , focuses on IoT development and connecting everything with everything. Michael is the author of numerous articles, papers, and books, such as D3.js By Example, Instant Lucene. NET, Learning Pandas, and Mastering Pandas for Finance, all by Packt. Michael is also a frequent speaker at .NET user groups and various mobile, cloud, and IoT conferences and delivers webinars on advanced technologies.
Read more about Michael Heydt

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages