Scraping Data from the Web with Beautiful Soup 4

In the previous chapter, we wrote a piece of code that communicates with the Nominatim web service in order to collect information. Frequently, however, there is no API in place, and data could be scattered throughout hundreds of web pages, or, even worse, files with a complex structure (PDFs). In this chapter, we'll explore another data collection path—scraping raw HTML pages. In order to do so, we will use another library, Beautiful Soup 4, which can parse raw HTML files into objects, and help us to sift through them, extracting bits of information. Using this tool, we will collect a relatively large dataset of historic battles of World War II, which we will, in the chapters to come, process, clean, and analyze.

In this chapter, we will cover the following topics:

When there is no API
Scraping WWII battles
Beyond...

Technical requirements

In this chapter, we'll make use of requests and BeautifulSoup libraries—both are included in the Anaconda distribution. If you don't use Anaconda, make sure to have them both installed. Given that you will scrape data from the web, an internet connection is also required. As usual, the code for this chapter is stored in Chapter07 folder in the GitHub repository, https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications.

When there is no API

As with API services, web pages have their owners, and they may or may not be open to the idea of scraping their data. If there is an API in place, this is always preferred over scraping, for the following reasons:

First, it is usually way better and simpler to use, and there are a number of guarantees that API owners will retain its structure, or at least let you know of upcoming changes in advance. With HTML web pages, there is no guarantee whatsoever; the website will often change, and they won't tell you ahead of time, so expect lots of emergency breaking changes!
Second, being a good citizen, it is substantially cheaper, computation-wise, to serve raw data than a full-blown HTML page, so the service owners will be thankful.
Lastly, some data (for example, historic changes) will not be available via the web page.

However, there are plenty of examples...

Scraping WWII battles

The goal of this chapter is to collect the information on all battles in WWII from Wikipedia. A corresponding list is provided: https://en.wikipedia.org/wiki/List_of_World_War_II_battles. As you can see, it contains links to a large set of pages, one for each battle, operation, and campaign. Furthermore, the list is structured, so battles are grouped according to the campaign or operation, which are, in turn, grouped by the theaters – it would be great to preserve this hierarchy! Most elements of the list also have a date. We'll work with those lists in a minute.

Now, if you check a couple of pages for specific battles, you may notice that they have a similar structure. For most of them, the large information card on the right has a set of similar subsections, including the main section with dates, locations, and outcomes, and a few additional...

Beyond Beautiful Soup

In this example, we used the BS4 library to parse static HTML for us. Beautiful Soup is an invaluable library for dealing with occasionally messy HTML, but when it comes to large scales and dynamic pages, it simply won't suffice. For production scraping in large quantities, perhaps on a regular basis, it is a good idea to utilize the Scrapy (https://scrapy.org/) package. Scrapy is an entire framework for downloading HTML, parsing data, pulling data, and then storing it. One of its killer features is that it can run asynchronously – for example, while it is waiting for one page to load, it can switch to processing another, automatically. Because of that, Scrapy's scrapers are significantly faster on large lists of websites. At the same time, its interface is more expressive for a developer, as it is explicitly designed for scraping.

Depending...

Summary

In this chapter, we learned the hard work of scraping data from HTML pages through the use of the Beautiful Soup 4 library. Using it, we were able to collect all the links from one page, preserving the hierarchy, and retrieve the information for each of the collected links. This skill is invaluable, as it allows you to collect information from the internet, for research, business, or as a personal hobby.

We also touched on Selenium, which emulates a full-blown browser, can interact with the page and execute JavaScript, giving us access beyond static content.

In the next chapter, we'll clean and use the data we collected, creating an interactive visualization of the war.

Questions

What does the term web scraping mean in this context?
What are the biggest differences between scraping and using an API? What are the challenges?
What exactly does Beautiful Soup do? Can we scrape without it?
Why did we use recursion here?
Should we clean data during scraping?
What is the right approach to dealing with missing data or broken links?