Reader small image

You're reading from  Python Web Scraping Cookbook

Product typeBook
Published inFeb 2018
Reading LevelBeginner
PublisherPackt
ISBN-139781787285217
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Michael Heydt
Michael Heydt
author image
Michael Heydt

Michael Heydt is an independent consultant, programmer, educator, and trainer. He has a passion for learning and sharing his knowledge of new technologies. Michael has worked in multiple industry verticals, including media, finance, energy, and healthcare. Over the last decade, he worked extensively with web, cloud, and mobile technologies and managed user experiences, interface design, and data visualization for major consulting firms and their clients. Michael's current company, Seamless Thingies , focuses on IoT development and connecting everything with everything. Michael is the author of numerous articles, papers, and books, such as D3.js By Example, Instant Lucene. NET, Learning Pandas, and Mastering Pandas for Finance, all by Packt. Michael is also a frequent speaker at .NET user groups and various mobile, cloud, and IoT conferences and delivers webinars on advanced technologies.
Read more about Michael Heydt

Right arrow

Data Acquisition and Extraction

In this chapter, we will cover:

  • How to parse websites and navigate the DOM using BeautifulSoup
  • Searching the DOM with Beautiful Soup's find methods
  • Querying the DOM with XPath and lxml
  • Querying data with XPath and CSS Selectors
  • Using Scrapy selectors
  • Loading data in Unicode / UTF-8 format

Introduction

The key aspects for effective scraping are understanding how content and data are stored on web servers, identifying the data you want to retrieve, and understanding how the tools support this extraction. In this chapter, we will discuss website structures and the DOM, introduce techniques to parse, and query websites with lxml, XPath, and CSS. We will also look at how to work with websites developed in other languages and different encoding types such as Unicode.

Ultimately, understanding how to find and extract data within an HTML document comes down to understanding the structure of the HTML page, its representation in the DOM, the process of querying the DOM for specific elements, and how to specify which elements you want to retrieve based upon how the data is represented.

How to parse websites and navigate the DOM using BeautifulSoup

When the browser displays a web page it builds a model of the content of the page in a representation known as the document object model (DOM). The DOM is a hierarchical representation of the page's entire content, as well as structural information, style information, scripts, and links to other content.

It is critical to understand this structure to be able to effectively scrape data from web pages. We will look at an example web page, its DOM, and examine how to navigate the DOM with Beautiful Soup.

Getting ready

We will use a small web site that is included in the www folder of the sample code. To follow along, start a web server from within the www folder...

Searching the DOM with Beautiful Soup's find methods

We can perform simple searches of the DOM using Beautiful Soup's find methods. These methods give us a much more flexible and powerful construct for finding elements that are not dependent upon the hierarchy of those elements. In this recipe we will examine several common uses of these functions to locate various elements in the DOM.

Getting ready

ff you want to cut and paste the following into ipython, you can find the samples in 02/02_bs4_find.py.

How to do it...

We will start with a fresh iPython session...

Querying the DOM with XPath and lxml

XPath is a query language for selecting nodes from an XML document and is a must-learn query language for anyone performing web scraping. XPath offers a number of benefits to its user over other model-based tools:

  • Can easily navigate through the DOM tree
  • More sophisticated and powerful than other selectors like CSS selectors and regular expressions
  • It has a great set (200+) of built-in functions and is extensible with custom functions
  • It is widely supported by parsing libraries and scraping platforms

XPath contains seven data models (we have seen some of them previously):

  • root node (top level parent node)
  • element nodes (<a>..</a>)
  • attribute nodes (href="example.html")
  • text nodes ("this is a text")
  • comment nodes (<!-- a comment -->)
  • namespace nodes
  • processing instruction nodes

XPath expressions can...

Querying data with XPath and CSS selectors

CSS selectors are patterns used for selecting elements and are often used to define the elements that styles should be applied to. They can also be used with lxml to select nodes in the DOM. CSS selectors are commonly used as they are more compact than XPath and generally can be more reusable in code. Examples of common selectors which may be used are as follows:

What you are looking for Example
All tags *
A specific tag (that is, tr) .planet
A class name (that is, "planet") tr.planet
A tag with an ID "planet3" tr#planet3
A child tr of a table table tr
A descendant tr of a table table tr
A tag with an attribute (that is, tr with id="planet4") a[id=Mars]

Getting ready

...

Using Scrapy selectors

Scrapy is a Python web spider framework that is used to extract data from websites. It provides many powerful features for navigating entire websites, such as the ability to follow links. One feature it provides is the ability to find data within a document using the DOM, and using the now, quite familiar, XPath.

In this recipe we will load the list of current questions on StackOverflow, and then parse this using a scrapy selector. Using that selector, we will extract the text of each question.

Getting ready

The code for this recipe is in 02/05_scrapy_selectors.py.

How to do it...

...

Loading data in unicode / UTF-8

A document's encoding tells an application how the characters in the document are represented as bytes in the file. Essentially, the encoding specifies how many bits there are per character. In a standard ASCII document, all characters are 8 bits. HTML files are often encoded as 8 bits per character, but with the globalization of the internet, this is not always the case. Many HTML documents are encoded as 16-bit characters, or use a combination of 8- and 16-bit characters.

A particularly common form HTML document encoding is referred to as UTF-8. This is the encoding form that we will examine.

Getting ready

We will read a file named unicode.html from our local web server, located at http...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Python Web Scraping Cookbook
Published in: Feb 2018Publisher: PacktISBN-13: 9781787285217
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Michael Heydt

Michael Heydt is an independent consultant, programmer, educator, and trainer. He has a passion for learning and sharing his knowledge of new technologies. Michael has worked in multiple industry verticals, including media, finance, energy, and healthcare. Over the last decade, he worked extensively with web, cloud, and mobile technologies and managed user experiences, interface design, and data visualization for major consulting firms and their clients. Michael's current company, Seamless Thingies , focuses on IoT development and connecting everything with everything. Michael is the author of numerous articles, papers, and books, such as D3.js By Example, Instant Lucene. NET, Learning Pandas, and Mastering Pandas for Finance, all by Packt. Michael is also a frequent speaker at .NET user groups and various mobile, cloud, and IoT conferences and delivers webinars on advanced technologies.
Read more about Michael Heydt