Reader small image

You're reading from  Python Web Scraping. - Second Edition

Product typeBook
Published inMay 2017
Reading LevelIntermediate
Publisher
ISBN-139781786462589
Edition2nd Edition
Languages
Concepts
Right arrow
Author (1)
Katharine Jarmul
Katharine Jarmul
author image
Katharine Jarmul

Katharine Jarmul is a data scientist and Pythonista based in Berlin, Germany. She runs a data science consulting company, Kjamistan, that provides services such as data extraction, acquisition, and modelling for small and large companies. She has been writing Python since 2008 and scraping the web with Python since 2010, and has worked at both small and large start-ups who use web scraping for data analysis and machine learning. When she's not scraping the web, you can follow her thoughts and activities via Twitter (@kjam)
Read more about Katharine Jarmul

Right arrow

Scraping the Data

In the previous chapter, we built a crawler which follows links to download the web pages we want. This is interesting but not useful-the crawler downloads a web page, and then discards the result. Now, we need to make this crawler achieve something by extracting data from each web page, which is known as scraping.

We will first cover browser tools to examine a web page, which you may already be familiar with if you have a web development background. Then, we will walk through three approaches to extract data from a web page using regular expressions, Beautiful Soup and lxml. Finally, the chapter will conclude with a comparison of these three scraping alternatives.

In this chapter, we will cover the following topics:

  • Analyzing a web page
  • Approaches to scrape a web page
  • Using the console
  • xpath selectors
  • Scraping results
...

Analyzing a web page

To understand how a web page is structured, we can try examining the source code. In most web browsers, the source code of a web page can be viewed by right-clicking on the page and selecting the View page source option:

For our example website, the data we are interested in is found on the country pages. Take a look at page source (via browser menu or right click browser menu). In the source for the example page for the United Kingdom (http://example.webscraping.com/view/United-Kingdom-239) you will find a table containing the country data (you can use search to find this in the page source code):

<table> 
<tr id="places_national_flag__row"><td class="w2p_fl"><label for="places_national_flag" id="places_national_flag__label">National Flag:</label></td>
<td class="w2p_fw"><img src="/places...

Three approaches to scrape a web page

Now that we understand the structure of this web page we will investigate three different approaches to scraping its data, first with regular expressions, then with the popular BeautifulSoup module, and finally with the powerful lxml module.

Regular expressions

If you are unfamiliar with regular expressions or need a reminder, there is a thorough overview available at https://docs.python.org/3/howto/regex.html. Even if you use regular expressions (or regex) with another programming language, I recommend stepping through it for a refresher on regex with Python.

Because each chapter might build or use parts of previous chapters, we recommend setting up your file structure similar to that in the book repository. All code can then...

CSS selectors and your Browser Console

Like the notation we used to extract using cssselect, CSS selectors are patterns used for selecting HTML elements. Here are some examples of common selectors you should know:

Select any tag: * 
Select by tag <a>: a
Select by class of "link": .link
Select by tag <a> with class "link": a.link
Select by tag <a> with ID "home": a#home
Select by child <span> of tag <a>: a > span
Select by descendant <span> of tag <a>: a span
Select by tag <a> with attribute title of "Home": a[title=Home]

The cssselect library implements most CSS3 selectors, and details on unsupported features (primarily browser interactions) are available at https://cssselect.readthedocs.io/en/latest/#supported-selectors.

The CSS3 specification was produced by the W3C and is available for viewing at http://www.w3.org/TR/2011...

XPath Selectors

There are times when using CSS selectors will not work. This is especially the case with very broken HTML or improperly formatted elements. Despite the best efforts of libraries like BeautifulSoup and lxml to properly parse and clean up the code; it will not always work - and in these cases, XPath can help you build very specific selectors based on hierarchical relationships of elements on the page.

XPath is a way of describing relationships as an hierarchy in XML documents. Because HTML is formed using XML elements, we can also use XPath to navigate and select elements from an HTML document.

To read more about XPath, check out the Mozilla developer documentation: https://developer.mozilla.org/en-US/docs/Web/XPath.

XPath follows some basic syntax rules and has some similarities with CSS selectors. Take a look at the following chart for some quick references between the two.

Selector description...

LXML and Family Trees

lxml also has the ability to traverse family trees within the HTML page. What is a family tree? When you used your browser's developer tools to investigate the elements on the page and you were able to expand or retract them, you were observing family relationships in the HTML. Every element on a web page can have parents, siblings and children. These relationships can help us more easily traverse the page.

For example, if I want to find all the elements at the same node depth level on the page, I would be looking for their siblings. Or maybe I want every element that is a child of a particular element on the page. lxml allows us to use many of these relationships with simple Python code.

As an example, let's investigate all children of the table element on the example page:

>>> table = tree.xpath('//table')[0]
>>> table.getchildren()
[<Element tr at...

Comparing performance

To help evaluate the trade-offs between the three scraping approaches described in the section, Three approaches to scrape a web page, it would be helpful to compare their relative efficiency. Typically, a scraper would extract multiple fields from a web page. So, for a more realistic comparison, we will implement extended versions of each scraper which extract all the available data from a country's web page. To get started, we need to return to our browser to check the format of the other country features, as shown here:

By using our browser's inspect capabilities, we can see each table row has an ID starting with places_ and ending with __row. The country data is contained within these rows in the same format as the area example. Here are implementations that use this information to extract all of the available country data:

FIELDS = ('area', 'population',...

Scraping results

Now that we have complete implementations for each scraper, we will test their relative performance with this snippet. The imports in the code expect your directory structure to be similar to the book's repository, so please adjust as necessary:

import time
import re
from chp2.all_scrapers import re_scraper, bs_scraper,
lxml_scraper, lxml_xpath_scraper
from chp1.advanced_link_crawler import download

NUM_ITERATIONS = 1000 # number of times to test each scraper
html = download('http://example.webscraping.com/places/view/United-Kingdom-239')

scrapers = [
('Regular expressions', re_scraper),
('BeautifulSoup', bs_scraper),
('Lxml', lxml_scraper),
('Xpath', lxml_xpath_scraper)]

for name, scraper in scrapers:
# record start time of scrape
start = time.time()
for i in range(NUM_ITERATIONS):
if scraper == re_scraper:
re...

Summary

In this chapter, we walked through a variety of ways to scrape data from a web page. Regular expressions can be useful for a one-off scrape or to avoid the overhead of parsing the entire web page, and BeautifulSoup provides a high-level interface while avoiding any difficult dependencies. However, in general, lxml will be the best choice because of its speed and extensive functionality, so we will use it in future examples.

We also learned how to inspect HTML pages using browser tools and the console and define CSS selectors and XPath selectors to match and extract content from the downloaded pages.

In the next chapter we will introduce caching, which allows us to save web pages so they only need be downloaded the first time a crawler is run.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Python Web Scraping. - Second Edition
Published in: May 2017Publisher: ISBN-13: 9781786462589
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Katharine Jarmul

Katharine Jarmul is a data scientist and Pythonista based in Berlin, Germany. She runs a data science consulting company, Kjamistan, that provides services such as data extraction, acquisition, and modelling for small and large companies. She has been writing Python since 2008 and scraping the web with Python since 2010, and has worked at both small and large start-ups who use web scraping for data analysis and machine learning. When she's not scraping the web, you can follow her thoughts and activities via Twitter (@kjam)
Read more about Katharine Jarmul