Packt+ | Advance your knowledge in tech

You're reading from Web Scraping with Python

Product type Book

Published in Oct 2015

Publisher Packt

ISBN-13 9781782164364

Pages 174 pages

Edition 1st Edition

Languages

Python

Concepts

Data Mining

Author (1):

Richard Penman

Table of Contents (16) Chapters

Web Scraping with Python

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Introduction to Web Scraping

Scraping the Data

Caching Downloads

Concurrent Downloading

Dynamic Content

Interacting with Forms

Solving CAPTCHA

Scrapy

Overview

Index

Chapter 3. Caching Downloads

In the preceding chapter, we learned how to scrape data from crawled web pages and save the results to a spreadsheet. What if we now want to scrape an additional field, such as the flag URL? To scrape additional fields, we would need to download the entire website again. This is not a significant obstacle for our small example website. However, other websites can have millions of web pages that would take weeks to recrawl. The solution presented in this chapter is to cache all the crawled web pages so that they only need to be downloaded once.

Adding cache support to the link crawler

To support caching, the download function developed in Chapter 1, Introduction to Web Scraping, needs to be modified to check the cache before downloading a URL. We also need to move throttling inside this function and only throttle when a download is made, and not when loading from a cache. To avoid the need to pass various parameters for every download, we will take this opportunity to refactor the download function into a class, so that parameters can be set once in the constructor and reused multiple times. Here is the updated implementation to support this:

class Downloader:
    def __init__(self, delay=5, user_agent='wswp', proxies=None, num_retries=1, cache=None):
        self.throttle = Throttle(delay)
        self.user_agent = user_agent
        self.proxies = proxies
        self.num_retries = num_retries
        self.cache = cache

    def __call__(self, url):
        result = None
        if self.cache:
            try:
               ...

Disk cache

To cache downloads, we will first try the obvious solution and save web pages to the filesystem. To do this, we will need a way to map URLs to a safe cross-platform filename. The following table lists the limitations for some popular filesystems:

Operating system	File system	Invalid filename characters	Maximum filename length
Linux	Ext3/Ext4	/ and \0	255 bytes
OS X	HFS Plus	: and \0	255 UTF-16 code units
Windows	NTFS	\, /, ?, :, *, ", >, <, and \|	255 characters

To keep our file path safe across these filesystems, it needs to be restricted to numbers, letters, basic punctuation, and replace all other characters with an underscore, as shown in the following code:

>>> import re
>>> url = 'http://example.webscraping.com/default/view/Australia-1'
>>> re.sub('[^/0-9a-zA-Z\-.,;_ ]', '_', url)
'http_//example.webscraping.com/default/view/Australia-1'

Additionally, the filename and the parent directories need to be restricted to 255 characters...

Database cache

To avoid the anticipated limitations to our disk-based cache, we will now build our cache on top of an existing database system. When crawling, we may need to cache massive amounts of data and will not need any complex joins, so we will use a NoSQL database, which is easier to scale than a traditional relational database. Specifically, our cache will use MongoDB, which is currently the most popular NoSQL database.

What is NoSQL?

NoSQL stands for Not Only SQL and is a relatively new approach to database design. The traditional relational model used a fixed schema and splits the data into tables. However, with large datasets, the data is too big for a single server and needs to be scaled across multiple servers. This does not fit well with the relational model because, when querying multiple tables, the data will not necessarily be available on the same server. NoSQL databases, on the other hand, are generally schemaless and designed from the start to shard seamlessly across...

Summary

In this chapter, we learned that caching downloaded web pages will save time and minimize bandwidth when recrawling a website. The main drawback of this is that the cache takes up disk space, which can be minimized through compression. Additionally, building on top of an existing database system, such as MongoDB, can be used to avoid any filesystem limitations.

In the next chapter, we will add further functionalities to our crawler so that we can download multiple web pages concurrently and crawl faster.

The rest of the chapter is locked

You have been reading a chapter from

Web Scraping with Python

Published in: Oct 2015 Publisher: Packt ISBN-13: 9781782164364

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime}

Authors (1)

Richard Penman

Richard Lawson is from Australia and studied Computer Science at the University of Melbourne. Since graduating, he built a business specializing in web scraping while travelling the world, working remotely from over 50 countries. He is a fluent Esperanto speaker, conversational in Mandarin and Korean, and active in contributing to and translating open source software. He is currently undertaking postgraduate studies at Oxford University and in his spare time enjoys developing autonomous drones.

See other products by Richard Penman