Data-Processing Pipelines

"Inside every well-written large program is a well-written small program."

- Tony Hoare

Pipelines are a fairly standard and used way to segregate the processing of data into multiple stages. In this chapter, we will be exploring the basic principles behind data-processing pipelines and present a blueprint for implementing generic, concurrent-safe, and reusable pipelines using Go primitives, such as channels, contexts, and go-routines.

In this chapter, you will learn about the following:

Designing a generic processing pipeline from scratch using Go primitives
Approaches to modeling pipeline payloads in a generic way
Strategies for dealing with errors that can occur while a pipeline is executing
Pros and cons of synchronous and asynchronous pipeline design
Applying pipeline design concepts to building the Links 'R' Us crawler component...

Technical requirements

The full code for the topics discussed in this chapter has been published to this book's GitHub repository under the Chapter07 folder.

You can access the GitHub repository that contains the code and all required resources for each of this book's chapters by going to https://github.com/PacktPublishing/Hands-On-Software-Engineering-with-Golang.

To get you up and running as quickly as possible, each example project includes a makefile that defines the following set of targets:

Makefile target	Description
`deps`	Install any required dependencies
`test`	Run all tests and report coverage
`lint`	Check for lint errors

As with all other book chapters, you will need a fairly recent version of Go, which you can download at https://golang.org/dl.

Building a generic data-processing pipeline in Go

The following figure illustrates the high-level design of the pipeline that we will be building throughout the first half of this chapter:

Figure 1: A generic, multistage pipeline

Keep in mind that this is definitely not the only, or necessarily the best, way to go about implementing a data-processing pipeline. Pipelines are inherently application specific, so there is not really a one-size-fits-all guide for constructing efficient pipelines.

Having said that, the proposed design is applicable to a wide variety of use cases, including, but not limited to, the crawler component for the Links 'R' Us project. Let's examine the preceding figure in a bit more detail and identify the basic components that the pipeline comprises:

The input source: Inputs essentially function as data-sources that pump data into the pipeline...

Building a crawler pipeline for the Links 'R' Us project

In the following sections, we will be putting the generic pipeline package that we built to the test by using it to construct the crawler pipeline for the Links 'R' Us project!

Following the single-responsibility principle, we will break down the crawl task into a sequence of smaller subtasks and assemble the pipeline illustrated in the following figure. The decomposition into smaller subtasks also comes with the benefit that each stage processor can be tested in total isolation without the need to create a pipeline instance:

Figure 2: The stages of the crawler pipeline that we will be constructing

The full code for the crawler and its tests can be found in the Chapter07/crawler package, which you can find at the book's GitHub repository.

...

Summary

In this chapter, we built from scratch our very own generic, extensible pipeline package using nothing more than the basic Go primitives. We have analyzed and implemented different strategies (FIFO, fixed/dynamic worker pools, and broadcasting) for processing data throughout the various stages of our pipeline. In the last part of the chapter, we applied everything that we have learned so far to implement a multistage crawler pipeline for the Links 'R' Us Project.

In summary, pipelines provide an elegant solution for breaking down complex data processing tasks into smaller and easier-to-test steps that can be executed in parallel to make better use of the compute resources available at your disposal. In the next chapter, we are going to take a look at a different paradigm for processing data that is organized as a graph.

...

Questions

Why is it considered an antipattern to use interface{} values as arguments to functions and methods?
You are trying to design and build a complex data-processing pipeline that requires copious amounts of computing power (for example, face recognition, audio transcription, or similar). However, when you try to run it on your local machine, you realize that the resource requirements for some of the stages exceed the ones that are currently available locally. Describe how you could modify your current pipeline setup so that you could still run the pipeline on your machine, but arrange for some parts of the pipeline to execute on a remote server that you control.
Describe how you would apply the decorator pattern to log errors returned by the processor functions that you have attached to a pipeline.
What are the key differences between a synchronous and an asynchronous...