You're reading from Python Web Scraping Cookbook

Product typeBook

Published inFeb 2018

Reading LevelBeginner

PublisherPackt

ISBN-139781787285217

Edition1st Edition

Languages

Python

Tools

Scrapy

Concepts

Data Mining

Author (1)

Michael Heydt

Data Acquisition and Extraction

In this chapter, we will cover:

How to parse websites and navigate the DOM using BeautifulSoup
Searching the DOM with Beautiful Soup's find methods
Querying the DOM with XPath and lxml
Querying data with XPath and CSS Selectors
Using Scrapy selectors
Loading data in Unicode / UTF-8 format

Introduction

The key aspects for effective scraping are understanding how content and data are stored on web servers, identifying the data you want to retrieve, and understanding how the tools support this extraction. In this chapter, we will discuss website structures and the DOM, introduce techniques to parse, and query websites with lxml, XPath, and CSS. We will also look at how to work with websites developed in other languages and different encoding types such as Unicode.

Ultimately, understanding how to find and extract data within an HTML document comes down to understanding the structure of the HTML page, its representation in the DOM, the process of querying the DOM for specific elements, and how to specify which elements you want to retrieve based upon how the data is represented.

How to parse websites and navigate the DOM using BeautifulSoup

When the browser displays a web page it builds a model of the content of the page in a representation known as the document object model (DOM). The DOM is a hierarchical representation of the page's entire content, as well as structural information, style information, scripts, and links to other content.

It is critical to understand this structure to be able to effectively scrape data from web pages. We will look at an example web page, its DOM, and examine how to navigate the DOM with Beautiful Soup.

Getting ready

We will use a small web site that is included in the www folder of the sample code. To follow along, start a web server from within the www folder...

Searching the DOM with Beautiful Soup's find methods

We can perform simple searches of the DOM using Beautiful Soup's find methods. These methods give us a much more flexible and powerful construct for finding elements that are not dependent upon the hierarchy of those elements. In this recipe we will examine several common uses of these functions to locate various elements in the DOM.

Getting ready

ff you want to cut and paste the following into ipython, you can find the samples in 02/02_bs4_find.py.

How to do it...

We will start with a fresh iPython session...

Querying the DOM with XPath and lxml

XPath is a query language for selecting nodes from an XML document and is a must-learn query language for anyone performing web scraping. XPath offers a number of benefits to its user over other model-based tools:

Can easily navigate through the DOM tree
More sophisticated and powerful than other selectors like CSS selectors and regular expressions
It has a great set (200+) of built-in functions and is extensible with custom functions
It is widely supported by parsing libraries and scraping platforms

XPath contains seven data models (we have seen some of them previously):

root node (top level parent node)
element nodes (<a>..</a>)
attribute nodes (href="example.html")
text nodes ("this is a text")
comment nodes ()
namespace nodes
processing instruction nodes

XPath expressions can...

Querying data with XPath and CSS selectors

CSS selectors are patterns used for selecting elements and are often used to define the elements that styles should be applied to. They can also be used with lxml to select nodes in the DOM. CSS selectors are commonly used as they are more compact than XPath and generally can be more reusable in code. Examples of common selectors which may be used are as follows:

What you are looking for	Example
All tags	`*`
A specific tag (that is, `tr`)	`.planet`
A class name (that is, `"planet"`)	`tr.planet`
A tag with an `ID "planet3"`	`tr#planet3`
A child `tr` of a table	`table tr`
A descendant `tr` of a table	`table tr`
A tag with an attribute (that is, `tr` with `id="planet4"`)	`a[id=Mars]`

Getting ready

...

Using Scrapy selectors

Scrapy is a Python web spider framework that is used to extract data from websites. It provides many powerful features for navigating entire websites, such as the ability to follow links. One feature it provides is the ability to find data within a document using the DOM, and using the now, quite familiar, XPath.

In this recipe we will load the list of current questions on StackOverflow, and then parse this using a scrapy selector. Using that selector, we will extract the text of each question.

Getting ready

The code for this recipe is in 02/05_scrapy_selectors.py.

How to do it...

...

Loading data in unicode / UTF-8

A document's encoding tells an application how the characters in the document are represented as bytes in the file. Essentially, the encoding specifies how many bits there are per character. In a standard ASCII document, all characters are 8 bits. HTML files are often encoded as 8 bits per character, but with the globalization of the internet, this is not always the case. Many HTML documents are encoded as 16-bit characters, or use a combination of 8- and 16-bit characters.

A particularly common form HTML document encoding is referred to as UTF-8. This is the encoding form that we will examine.

Getting ready

We will read a file named unicode.html from our local web server, located at http...

The rest of the chapter is locked

You have been reading a chapter from

Python Web Scraping Cookbook

Published in: Feb 2018Publisher: PacktISBN-13: 9781787285217

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Michael Heydt

Michael Heydt is an independent consultant, programmer, educator, and trainer. He has a passion for learning and sharing his knowledge of new technologies. Michael has worked in multiple industry verticals, including media, finance, energy, and healthcare. Over the last decade, he worked extensively with web, cloud, and mobile technologies and managed user experiences, interface design, and data visualization for major consulting firms and their clients. Michael's current company, Seamless Thingies , focuses on IoT development and connecting everything with everything. Michael is the author of numerous articles, papers, and books, such as D3.js By Example, Instant Lucene. NET, Learning Pandas, and Mastering Pandas for Finance, all by Packt. Michael is also a frequent speaker at .NET user groups and various mobile, cloud, and IoT conferences and delivers webinars on advanced technologies.
Read more about Michael Heydt

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages