Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Learn Python by Building Data Science Applications

You're reading from  Learn Python by Building Data Science Applications

Product type Book
Published in Aug 2019
Publisher Packt
ISBN-13 9781789535365
Pages 482 pages
Edition 1st Edition
Languages
Authors (2):
Philipp Kats Philipp Kats
Profile icon Philipp Kats
David Katz David Katz
Profile icon David Katz
View More author details

Table of Contents (26) Chapters

Preface Section 1: Getting Started with Python
Preparing the Workspace First Steps in Coding - Variables and Data Types Functions Data Structures Loops and Other Compound Statements First Script – Geocoding with Web APIs Scraping Data from the Web with Beautiful Soup 4 Simulation with Classes and Inheritance Shell, Git, Conda, and More – at Your Command Section 2: Hands-On with Data
Python for Data Applications Data Cleaning and Manipulation Data Exploration and Visualization Training a Machine Learning Model Improving Your Model – Pipelines and Experiments Section 3: Moving to Production
Packaging and Testing with Poetry and PyTest Data Pipelines with Luigi Let's Build a Dashboard Serving Models with a RESTful API Serverless API Using Chalice Best Practices and Python Performance Assessments Other Books You May Enjoy

Scraping Data from the Web with Beautiful Soup 4

In the previous chapter, we wrote a piece of code that communicates with the Nominatim web service in order to collect information. Frequently, however, there is no API in place, and data could be scattered throughout hundreds of web pages, or, even worse, files with a complex structure (PDFs). In this chapter, we'll explore another data collection path—scraping raw HTML pages. In order to do so, we will use another library, Beautiful Soup 4, which can parse raw HTML files into objects, and help us to sift through them, extracting bits of information. Using this tool, we will collect a relatively large dataset of historic battles of World War II, which we will, in the chapters to come, process, clean, and analyze.

In this chapter, we will cover the following topics:

  • When there is no API
  • Scraping WWII battles
  • Beyond...

Technical requirements

In this chapter, we'll make use of requests and BeautifulSoup libraries—both are included in the Anaconda distribution. If you don't use Anaconda, make sure to have them both installed. Given that you will scrape data from the web, an internet connection is also required. As usual, the code for this chapter is stored in Chapter07 folder in the GitHub repository, https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications.

When there is no API

As with API services, web pages have their owners, and they may or may not be open to the idea of scraping their data. If there is an API in place, this is always preferred over scraping, for the following reasons:

  • First, it is usually way better and simpler to use, and there are a number of guarantees that API owners will retain its structure, or at least let you know of upcoming changes in advance. With HTML web pages, there is no guarantee whatsoever; the website will often change, and they won't tell you ahead of time, so expect lots of emergency breaking changes!
  • Second, being a good citizen, it is substantially cheaper, computation-wise, to serve raw data than a full-blown HTML page, so the service owners will be thankful.
  • Lastly, some data (for example, historic changes) will not be available via the web page.

However, there are plenty of examples...

Scraping WWII battles

The goal of this chapter is to collect the information on all battles in WWII from Wikipedia. A corresponding list is provided: https://en.wikipedia.org/wiki/List_of_World_War_II_battles. As you can see, it contains links to a large set of pages, one for each battle, operation, and campaign. Furthermore, the list is structured, so battles are grouped according to the campaign or operation, which are, in turn, grouped by the theaters – it would be great to preserve this hierarchy! Most elements of the list also have a date. We'll work with those lists in a minute.

Now, if you check a couple of pages for specific battles, you may notice that they have a similar structure. For most of them, the large information card on the right has a set of similar subsections, including the main section with dates, locations, and outcomes, and a few additional...

Beyond Beautiful Soup

In this example, we used the BS4 library to parse static HTML for us. Beautiful Soup is an invaluable library for dealing with occasionally messy HTML, but when it comes to large scales and dynamic pages, it simply won't suffice. For production scraping in large quantities, perhaps on a regular basis, it is a good idea to utilize the Scrapy (https://scrapy.org/) package. Scrapy is an entire framework for downloading HTML, parsing data, pulling data, and then storing it. One of its killer features is that it can run asynchronously – for example, while it is waiting for one page to load, it can switch to processing another, automatically. Because of that, Scrapy's scrapers are significantly faster on large lists of websites. At the same time, its interface is more expressive for a developer, as it is explicitly designed for scraping.

Depending...

Summary

In this chapter, we learned the hard work of scraping data from HTML pages through the use of the Beautiful Soup 4 library. Using it, we were able to collect all the links from one page, preserving the hierarchy, and retrieve the information for each of the collected links. This skill is invaluable, as it allows you to collect information from the internet, for research, business, or as a personal hobby.

We also touched on Selenium, which emulates a full-blown browser, can interact with the page and execute JavaScript, giving us access beyond static content.

In the next chapter, we'll clean and use the data we collected, creating an interactive visualization of the war.

Questions

  1. What does the term web scraping mean in this context?
  2. What are the biggest differences between scraping and using an API? What are the challenges?
  3. What exactly does Beautiful Soup do? Can we scrape without it?
  4. Why did we use recursion here?
  5. Should we clean data during scraping?
  6. What is the right approach to dealing with missing data or broken links?

Further reading

lock icon The rest of the chapter is locked
You have been reading a chapter from
Learn Python by Building Data Science Applications
Published in: Aug 2019 Publisher: Packt ISBN-13: 9781789535365
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}