In the preceding chapter, we learned how to scrape data from crawled web pages and save the results to a spreadsheet. What if we now want to scrape an additional field, such as the flag URL? To scrape additional fields, we would need to download the entire website again. This is not a significant obstacle for our small example website. However, other websites can have millions of web pages that would take weeks to recrawl. The solution presented in this chapter is to cache all the crawled web pages so that they only need to be downloaded once.
You're reading from Web Scraping with Python
To support caching, the download
function developed in Chapter 1, Introduction to Web Scraping, needs to be modified to check the cache before downloading a URL. We also need to move throttling inside this function and only throttle when a download is made, and not when loading from a cache. To avoid the need to pass various parameters for every download, we will take this opportunity to refactor the download
function into a class, so that parameters can be set once in the constructor and reused multiple times. Here is the updated implementation to support this:
class Downloader: def __init__(self, delay=5, user_agent='wswp', proxies=None, num_retries=1, cache=None): self.throttle = Throttle(delay) self.user_agent = user_agent self.proxies = proxies self.num_retries = num_retries self.cache = cache def __call__(self, url): result = None if self.cache: try: ...
To cache downloads, we will first try the obvious solution and save web pages to the filesystem. To do this, we will need a way to map URLs to a safe cross-platform filename. The following table lists the limitations for some popular filesystems:
Operating system |
File system |
Invalid filename characters |
Maximum filename length |
---|---|---|---|
Linux |
Ext3/Ext4 |
/ and \0 |
255 bytes |
OS X |
HFS Plus |
: and \0 |
255 UTF-16 code units |
Windows |
NTFS |
\, /, ?, :, *, ", >, <, and | |
255 characters |
To keep our file path safe across these filesystems, it needs to be restricted to numbers, letters, basic punctuation, and replace all other characters with an underscore, as shown in the following code:
>>> import re >>> url = 'http://example.webscraping.com/default/view/Australia-1' >>> re.sub('[^/0-9a-zA-Z\-.,;_ ]', '_', url) 'http_//example.webscraping.com/default/view/Australia-1'
Additionally, the filename and the parent directories need to be restricted to 255 characters...
To avoid the anticipated limitations to our disk-based cache, we will now build our cache on top of an existing database system. When crawling, we may need to cache massive amounts of data and will not need any complex joins, so we will use a NoSQL database, which is easier to scale than a traditional relational database. Specifically, our cache will use MongoDB, which is currently the most popular NoSQL database.
NoSQL stands for Not Only SQL and is a relatively new approach to database design. The traditional relational model used a fixed schema and splits the data into tables. However, with large datasets, the data is too big for a single server and needs to be scaled across multiple servers. This does not fit well with the relational model because, when querying multiple tables, the data will not necessarily be available on the same server. NoSQL databases, on the other hand, are generally schemaless and designed from the start to shard seamlessly across...
In this chapter, we learned that caching downloaded web pages will save time and minimize bandwidth when recrawling a website. The main drawback of this is that the cache takes up disk space, which can be minimized through compression. Additionally, building on top of an existing database system, such as MongoDB, can be used to avoid any filesystem limitations.
In the next chapter, we will add further functionalities to our crawler so that we can download multiple web pages concurrently and crawl faster.