Reader small image

You're reading from  Building Big Data Pipelines with Apache Beam

Product typeBook
Published inJan 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800564930
Edition1st Edition
Languages
Right arrow
Author (1)
Jan Lukavský
Jan Lukavský
author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský

Right arrow

Chapter 8: Understanding How Runners Execute Pipelines

So far in this book, we have focused on Apache Beam from the user's perspective. We have seen how to code pipelines in the Java Software Development Kit (SDK), how to use Domain-Specific Languages (DSLs) such as SQL, and how to use portability with the Python SDK. In this chapter, we will focus on how the runner executes the pipeline. This will help us if we want to develop a runner for a new technology, debug our code, or improve performance issues.

We will not try to implement our own runner in this chapter. Instead, we will focus on the theoretical concepts that underpin runners. We will explore the building blocks of a typical runner, and this will help us understand how a runner executes our user code.

After describing how runners implement the Beam model, we will conclude this chapter with an in-depth description of window semantics and using metrics for observability. Improving observability is key when attempting...

Describing the anatomy of an Apache Beam runner

Let's first take a look at the typical life cycle of a pipeline, from the construction time to the pipeline teardown. The complete life cycle is illustrated in the following figure:

Figure 8.1 – The execution of a pipeline by a runner

The pipeline construction is already well known – we spent most of this book showing how to construct and test pipelines. The next step is submitting the pipeline to a runner. This is the point where the pipeline crosses the SDK-runner boundary, typically by a call to Pipeline.run().

After the pipeline is submitted to the runner, the runner proceeds as follows:

  1. Once a runner receives a pipeline, it first performs pipeline validation. This consists of various runner-independent validations – for instance, validating that an appropriate window function and/or trigger is being set and depending on the boundedness of the inputs of the pipeline. These...

Explaining the differences between classic and portable runners

The description in the previous section – Describing the anatomy of an Apache Beam runner – applies to both classic and portable runners. However, there are some important differences between the two.

A classic runner is a runner that is implemented using the same programming language as the Beam SDK. The runner is made in a way that enables it to run the specific SDK only. An example of a classic runner is a classic FlinkRunner instance, which uses Apache Flink, has a native API implemented in Java, and is able to execute Beam pipelines written in the Java SDK. We used this runner throughout the first five chapters of this book.

A portable runner is implemented using the portability layer and as a result, it can be used to execute pipelines written in any SDK that is supported by Beam. However, this flexibility comes at a price – a portable runner implemented in Java and running a pipeline implemented...

Understanding how a runner handles state

As we already know, any complex computation will need to  group multiple data elements in order to do computation. Because the streaming processing cannot rely on sources being able to replay data (as opposed to pure batch processing, where this property is essential), any updates to the local state during the computation have to be fault-tolerant, and it is the responsibility of a runner to ensure this. The Beam state API is designed precisely to enable this. Any state access is handled by a runner-provided implementation of StateInternals (and TimerInternals for timers – in this discussion, we will treat timers as special cases of state, so we will not describe them independently). The StateInternals instances are responsible for creating the accessors for the state – for example, ValueState, BagState, MapState, and so on. The runner must create and manage these instances to ensure both fault tolerance and consistency...

Exploring the Apache Beam capability matrix

Beam makes sure that runners adhere to the Beam model to ensure that pipelines are portable between different runners. There are tests specifically designed to validate the compatibility between runners – these tests are called ValidatesRunner suites and are annotated using the @ValidatesRunner annotation. Any runner author can verify the compatibility of their runner against these tests. However, because runners are developed at different rates, it is possible that a specific runner at a certain moment in time does not support all of the features of the Beam model. Therefore, the complete set of ValidatesRunner tests can be narrowed down by toggling specific features of the model. The tests in the complete suite are then annotated using features such as UsesSetState or UsesSchema. A runner is then allowed to specify features that should be excluded from the tests. These excluded features should then match what is documented in the...

Understanding windowing semantics in depth

In Chapter 1, Introducing Data Processing with Apache Beam, we introduced the basic types of window functions. To recap, we defined the following:

  • Fixed windows
  • Sliding windows
  • Global window
  • Session windows

We also defined two basic types of windows: key-aligned and key-unaligned. The first three types (fixed, sliding, and global) are key-aligned, and session windows are key-unaligned (as in session windows, each window can start and end at different times for different keys). However, what we skipped in Chapter 1, Introduction to Data Processing with Apache Beam, was the fact that we can define completely custom windowing logic.

The Window.into transform accepts a generic WindowFn instance, which defines the following main methods:

  1. The assignWindows method, which assigns elements into a set of window labels.
  2. The isNonMerging method, which tells the runner whether the WindowFn instance defines merging...

Debugging pipelines and using Apache Beam metrics for observability

Observability is a key part of spotting potential issues with a running pipeline. It can be used to measure various performance characteristics, including the number of elements processed, the number of RPC calls to backend services, and the distribution of the event-time lags of elements flowing through the pipeline.

Although it should be possible to create a side output for each metric and handle the resulting stream like any data in the pipeline, the requirement for quick and simple feedback from running pipelines led Beam to create a simple API dedicated to metrics. Currently, Beam supports the following metrics:

  • Counters
  • Gauges
  • Distributions

A Counter instance is a metric that is represented by a single long value that can only be incremented or decremented (this can be by 1, or by another number).

A Gauge instance is a metric that also holds a single long value; however, this value...

Summary

In this chapter, we walked through how runners execute pipelines in both classic mode and using the portability layer. We have seen that classic runners are suitable only for cases where a particular underlying technology – for instance, Apache Flink – has an API in the same language as the pipeline SDK. The most practical cases for this include using the Java SDK for both the runner and the pipeline.

In cases where the language of the runner and the pipeline SDK differ, we have to use portability (Fn API), which brings some overhead. We have seen how pipeline fusion is used to reduce this overhead as much as possible. We have also discussed situations where we want to prevent fusion and how to do this by inserting a shuffle boundary.

Next, we discussed the responsibilities of a runner with regard to state management. We saw how the runner ensures fault tolerance and correctness upon failures. We outlined two basic types of fault-tolerant states: local state...

Why subscribe?

  • Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
  • Improve your learning with Skill Plans built especially for you
  • Get a free eBook or video every month
  • Fully searchable for easy access to vital information
  • Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at customercare@packtpub.com for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Building Big Data Pipelines with Apache Beam
Published in: Jan 2022Publisher: PacktISBN-13: 9781800564930
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský