Parsing HTML

In the previous chapters, we have dealt with whole web pages, which is not really practical for most web scrapers. Although it is nice to have all of the content from a web page, most of the time, you will only need small pieces of information from each page. In order to extract this information, you must learn to parse the standard formats of the web, the most common of these being HTML.

This chapter will cover the following topics:

What is the HTML format
Searching using the strings package
Searching using the regexp package
Searching using XPath queries
Searching using Cascading Style Sheets selectors

What is the HTML format?

HTML is the standard format used to provide web page context. An HTML page defines which elements a browser should draw, the content and style of the elements, and how the page should respond to interactions from the user. Looking back at our http://example.com/index.html response, you can see the following, which is what an HTML document looks like:

<!doctype html>
<html>
<head>
  <title>Example Domain</title>
  <meta charset="utf-8" />
  <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1" />
  <!-- The <style> section was removed for brevity -->
</head>
<body>
  <div>
    <h1>Example Domain</h1>
    <p>This domain is established to...

Searching using the strings package

The most basic way to search for content is to use the strings package from the Go standard library. The strings package allows you to perform various operations on String objects, including searching for matches, counting occurrences, and splitting strings into arrays. The utility of this package can cover some use cases that you may run into.

Example – Counting links

One quick and easy piece of information that we could extract using the strings package is to count the number of links that are contained in a web page. The strings package has a function called Count(), which returns the number of times a substring occurs in a string. As we have seen before, links are contained in...

Searching using the regexp package

The regexp package in the Go standard library provides a deeper level of search by using regular expressions. This defines a syntax that allows you to search for strings in more complex terms, as well as retrieving strings from a document. By using capture groups in regular expressions, you can extract data matching a query from the web page. Here are a few useful tasks that the regexp package can help you achieve.

Example – Finding links

In the previous section, we used the strings package to count the number of links on a page. By using the regexp package, we can take this example further and retrieve the actual links with the following regular expression:

 <a.*href\s*=\s*[...

Searching using XPath queries

In the previous examples for parsing HTML documents, we treated HTML simply as searchable text, where you can discover information by looking for specific strings. Fortunately, HTML documents actually have a structure. You can see that each set of tags can be viewed as some object, called a node, which can, in turn, contain more nodes. This creates a hierarchy of root, parent, and child nodes, providing a structured document. In particular, HTML documents are very similar to XML documents, although they are not fully XML-compliant. Because of this XML-like structure, we can search for content in the pages using XPath queries.

XPath queries define a way to traverse the hierarchy of nodes in an XML document, and return matching elements. In our previous examples, where we were looking for <a> tags in order to count and retrieve links, we needed...

Searching using Cascading Style Sheets selectors

You can see how using a structured query language makes searching for, and retrieving, information much easier than basic string searches. However, XPath was designed for generic XML documents, not HTML. There is another structured query language that is made specifically for HTML. Cascading Style Sheets (CSS) were created to provide a way to add stylistic elements to HTML pages. In a CSS file, you would define a path to an element or multiple elements, and what describes the appearance. The definitions for the path to the element are called CSS selectors and are written specifically for HTML documents.

CSS selectors understand common attributes that we could use in searching HTML documents. In the previous XPath examples, we often used a query such as div[@class="some-class"] in order to search for elements with the class...

Summary

You can see that there are various ways of extracting data from HTML pages using different tools. Basic string searches and regex searches can collect information using very simple techniques, but there are cases where more structured query languages are needed. XPath provides great searching capabilities by assuming the document is XML-formatted and can cover generic searches. CSS selectors are the simplest way to search for and extract data from HTML documents and provide many helpful features that are HTML-specific.

In Chapter 5, Web Scraping Navigation, we will look at the best ways to crawl the internet efficiently and safely.