Reader small image

You're reading from  Python Web Scraping. - Second Edition

Product typeBook
Published inMay 2017
Reading LevelIntermediate
Publisher
ISBN-139781786462589
Edition2nd Edition
Languages
Concepts
Right arrow
Author (1)
Katharine Jarmul
Katharine Jarmul
author image
Katharine Jarmul

Katharine Jarmul is a data scientist and Pythonista based in Berlin, Germany. She runs a data science consulting company, Kjamistan, that provides services such as data extraction, acquisition, and modelling for small and large companies. She has been writing Python since 2008 and scraping the web with Python since 2010, and has worked at both small and large start-ups who use web scraping for data analysis and machine learning. When she's not scraping the web, you can follow her thoughts and activities via Twitter (@kjam)
Read more about Katharine Jarmul

Right arrow

Caching Downloads

In the previous chapter, we learned how to scrape data from crawled web pages and save the results to a CSV file. What if we now want to scrape an additional field, such as the flag URL? To scrape additional fields, we would need to download the entire website again. This is not a significant obstacle for our small example website; however, other websites can have millions of web pages, which could take weeks to recrawl. One way scrapers avoid these problems is by caching crawled web pages from the beginning, so they only need to be downloaded once.
In this chapter, we will cover a few ways to do this using our web crawler.

In this chapter, we will cover the following topics:

  • When to use caching
  • Adding cache support to the link crawler
  • Testing the cache
  • Using requests - cache
  • Redis cache implementation

When to use caching?

To cache, or not to cache? This is a question many programmers, data scientists, and web scrapers need to answer. In this chapter, we will show you how to use caching for your web crawlers; but should you use caching?

If you need to perform a large crawl, which may be interrupted due to an error or exception, caching can help by not forcing you to recrawl all the pages you might have already covered. Caching can also help you by allowing you to access those pages while offline (for your own data analysis or development purposes).

However, if having the most up-to-date and current information from the site is your highest priority, then caching might not make sense. In addition, if you don't plan large or repeated crawls, you might just want to scrape the page each time.

You may want to outline how often the pages you are scraping change or how often you should scrape new pages and clear...

Disk Cache

To cache downloads, we will first try the obvious solution and save web pages to the filesystem. To do this, we will need a way to map URLs to a safe cross-platform filename. The following table lists limitations for some popular filesystems:

Operating system Filesystem Invalid filename characters Maximum filename length
Linux Ext3/Ext4 / and \0 255 bytes
OS X HFS Plus : and \0 255 UTF-16 code units
Windows NTFS \, /, ?, :, *, ", >, <, and | 255 characters

To keep our file path safe across these filesystems, it needs to be restricted to numbers, letters, and basic punctuation, and it should replace all other characters with an underscore, as shown in the following code:

>>> import re 
>>> url = 'http://example.webscraping.com/default/view/Australia-1'
>>> re.sub('[^/0-9a-zA-Z\-.,;_ ]', '_', url)
'http_//example.webscraping...

Key-value storage cache

To avoid the anticipated limitations to our disk-based cache, we will now build our cache on top of an existing key-value storage system. When crawling, we may need to cache massive amounts of data and will not need any complex joins, so we will use high availability key-value storage, which is easier to scale than a traditional relational database or even most NoSQL databases. Specifically, our cache will use Redis, which is a very popular key-value store.

What is key-value storage?

Key-value storage is very similar to a Python dictionary, in that each element in the storage has a key and a value. When designing the DiskCache, a key-value model lent itself well to the problem. Redis, in fact, stands for REmote DIctionary Server. Redis was...

Summary

In this chapter, we learned that caching downloaded web pages will save time and minimize bandwidth when recrawling a website. However, caching pages takes up disk space, some of which can be alleviated through compression. Additionally, building on top of an existing storage system, such as Redis, can be useful to avoid speed, memory, and filesystem limitations.

In the next chapter, we will add further functionalities to our crawler so we can download web pages concurrently and crawl the web even faster.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Python Web Scraping. - Second Edition
Published in: May 2017Publisher: ISBN-13: 9781786462589
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Katharine Jarmul

Katharine Jarmul is a data scientist and Pythonista based in Berlin, Germany. She runs a data science consulting company, Kjamistan, that provides services such as data extraction, acquisition, and modelling for small and large companies. She has been writing Python since 2008 and scraping the web with Python since 2010, and has worked at both small and large start-ups who use web scraping for data analysis and machine learning. When she's not scraping the web, you can follow her thoughts and activities via Twitter (@kjam)
Read more about Katharine Jarmul