Reader small image

You're reading from  Go Web Scraping Quick Start Guide

Product typeBook
Published inJan 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781789615708
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Vincent Smith
Vincent Smith
author image
Vincent Smith

Vincent Smith has been a software engineer for 10 years, having worked in various fields from health and IT to machine learning, and large-scale web scrapers. He has worked for both large-scale Fortune 500 companies and start-ups alike and has sharpened his skills from the best of both worlds. While obtaining a degree in electrical engineering, he learned the foundations of writing good code through his Java courses. These basics helped spur his career in software development early in his professional career in order to provide support for his team. He fell in love with the process of teaching computers how to behave and set him on the path he still walks today.
Read more about Vincent Smith

Right arrow

Scraping with Concurrency

As you begin to add more and more target websites into your scraping requirements, you will eventually hit a point where you wish you could make more calls, faster. In a single program, the crawl delay might add extra time to your scraper, adding unnecessary time to process the other sites. Do you see the problem in the following diagram?

If these two sites could be run in parallel, there would not be any interference. Maybe the time to access and parse a page is longer than the crawl delay for this website, and launching a second request before the processing of the first response completes could save you time as well. Look how the situation is improved in the following diagram:

In any of these cases, you will need to introduce concurrency into your web scraper.

In this chapter, we will cover the following topics:

  • What is concurrency
  • Concurrency pitfalls...

What is concurrency

Instructions in a program are run by a central processing unit (CPU). This CPU can run multiple threads, which can process instructions for separate tasks, together. This is achieved by switching between the two tasks and executing the instructions in an alternating fashion, like pulling together two sides of a zipper. This overlapping execution of tasks is called concurrency. For the sake of simplicity, we will describe it as performing multiple tasks at the same time. The following diagram shows how it may appear:

Concurrency should not be confused with parallelism, where two things or instructions can literally be executed at the same time.

By introducing a concurrent architecture to your web scraper, you will be able to make multiple web requests to different sites without waiting for one site to respond. In this way, a slow site or bad connection to...

Concurrency pitfalls

The source of most issues with concurrency is figuring out how to share information safely, and provide access to that information, between multiple threads. The simplest solution would seem to be to have an object that both threads can have access to, and modify, in order to communicate with the other thread. This seemingly innocent strategy is easier suggested than done. Let's look at this example, where two threads are sharing the same stack of web pages to scrape. They will need to know which web pages have been completed, and which web pages the other thread is currently working on.

We will use a simple map for this example, as shown in the following code:

siteStatus := map[string]string{
"http://example.com/page1.html" : "READY",
"http://example.com/page2.html" : "READY",
"http://example.com/page3...

The Go concurrency model

As you have seen, many of the problems with concurrent programs stem from sharing memory resources between multiple threads. This shared memory is used to communicate state and can be very fragile, with great care needed to ensure that everything stays up and running. In Go, concurrency is approached with the mantra:

Do not communicate by sharing memory; instead, share memory by communicating.

When you use mutexes and locks around a common object, you are communicating by sharing memory. Multiple threads look to the same memory location to alert and to provide information for the other threads to use. Instead of doing this, Go provides tools to help share memory by communicating.

Goroutines

Up until...

sync package helpers

Goroutines and channels, being the core constructs of concurrent programming in Go, will provide most of the utility that you will need. However, there are many helpful objects that the Go standard library provides that are also useful to know. We have already seen how sync.Mutex and sync.RWMutex work, but let's take a look at some of the other objects offered.

Conditions

Now that you are able to launch scraper tasks into multiple threads, some controls will need to be put into place so things don't get too out of hand. It is very simple in Go to launch 1,000 goroutines to scrape 1,000 pages simultaneously from a single program. However, your machine most likely cannot handle the same load. The...

Summary

In this chapter, we discussed many topics surrounding concurrency in web scraping. We looked at what concurrency is and how we can benefit from it. We reviewed some of the common issues that must be avoided when building concurrent programs. We also learned about the Go concurrency model and how to use its primitive objects to build concurrent Go applications. Finally, we looked at a few of the niceties Go has included in its sync package. In our final chapter, we will look at taking our scraper to the highest level.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Go Web Scraping Quick Start Guide
Published in: Jan 2019Publisher: PacktISBN-13: 9781789615708
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Vincent Smith

Vincent Smith has been a software engineer for 10 years, having worked in various fields from health and IT to machine learning, and large-scale web scrapers. He has worked for both large-scale Fortune 500 companies and start-ups alike and has sharpened his skills from the best of both worlds. While obtaining a degree in electrical engineering, he learned the foundations of writing good code through his Java courses. These basics helped spur his career in software development early in his professional career in order to provide support for his team. He fell in love with the process of teaching computers how to behave and set him on the path he still walks today.
Read more about Vincent Smith