Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Go Web Scraping Quick Start Guide

You're reading from  Go Web Scraping Quick Start Guide

Product type Book
Published in Jan 2019
Publisher Packt
ISBN-13 9781789615708
Pages 132 pages
Edition 1st Edition
Languages
Author (1):
Vincent Smith Vincent Smith
Profile icon Vincent Smith

Parsing HTML

In the previous chapters, we have dealt with whole web pages, which is not really practical for most web scrapers. Although it is nice to have all of the content from a web page, most of the time, you will only need small pieces of information from each page. In order to extract this information, you must learn to parse the standard formats of the web, the most common of these being HTML.

This chapter will cover the following topics:

  • What is the HTML format
  • Searching using the strings package
  • Searching using the regexp package
  • Searching using XPath queries
  • Searching using Cascading Style Sheets selectors

What is the HTML format?

HTML is the standard format used to provide web page context. An HTML page defines which elements a browser should draw, the content and style of the elements, and how the page should respond to interactions from the user. Looking back at our http://example.com/index.html response, you can see the following, which is what an HTML document looks like:

<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<!-- The <style> section was removed for brevity -->
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is established to...

Searching using the strings package

The most basic way to search for content is to use the strings package from the Go standard library. The strings package allows you to perform various operations on String objects, including searching for matches, counting occurrences, and splitting strings into arrays. The utility of this package can cover some use cases that you may run into.

Example – Counting links

One quick and easy piece of information that we could extract using the strings package is to count the number of links that are contained in a web page. The strings package has a function called Count(), which returns the number of times a substring occurs in a string. As we have seen before, links are contained in...

Searching using the regexp package

The regexp package in the Go standard library provides a deeper level of search by using regular expressions. This defines a syntax that allows you to search for strings in more complex terms, as well as retrieving strings from a document. By using capture groups in regular expressions, you can extract data matching a query from the web page. Here are a few useful tasks that the regexp package can help you achieve.

Example – Finding links

In the previous section, we used the strings package to count the number of links on a page. By using the regexp package, we can take this example further and retrieve the actual links with the following regular expression:

 <a.*href\s*=\s*[...

Searching using XPath queries

In the previous examples for parsing HTML documents, we treated HTML simply as searchable text, where you can discover information by looking for specific strings. Fortunately, HTML documents actually have a structure. You can see that each set of tags can be viewed as some object, called a node, which can, in turn, contain more nodes. This creates a hierarchy of root, parent, and child nodes, providing a structured document. In particular, HTML documents are very similar to XML documents, although they are not fully XML-compliant. Because of this XML-like structure, we can search for content in the pages using XPath queries.

XPath queries define a way to traverse the hierarchy of nodes in an XML document, and return matching elements. In our previous examples, where we were looking for <a> tags in order to count and retrieve links, we needed...

Searching using Cascading Style Sheets selectors

You can see how using a structured query language makes searching for, and retrieving, information much easier than basic string searches. However, XPath was designed for generic XML documents, not HTML. There is another structured query language that is made specifically for HTML. Cascading Style Sheets (CSS) were created to provide a way to add stylistic elements to HTML pages. In a CSS file, you would define a path to an element or multiple elements, and what describes the appearance. The definitions for the path to the element are called CSS selectors and are written specifically for HTML documents.

CSS selectors understand common attributes that we could use in searching HTML documents. In the previous XPath examples, we often used a query such as div[@class="some-class"] in order to search for elements with the class...

Summary

You can see that there are various ways of extracting data from HTML pages using different tools. Basic string searches and regex searches can collect information using very simple techniques, but there are cases where more structured query languages are needed. XPath provides great searching capabilities by assuming the document is XML-formatted and can cover generic searches. CSS selectors are the simplest way to search for and extract data from HTML documents and provide many helpful features that are HTML-specific.

In Chapter 5, Web Scraping Navigation, we will look at the best ways to crawl the internet efficiently and safely.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Go Web Scraping Quick Start Guide
Published in: Jan 2019 Publisher: Packt ISBN-13: 9781789615708
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}